Mistral "Mixtral" 8x7B 32k model [magnet] (opens in new tab)

(twitter.com)

546 pointsxyzzyrz2y ago239 comments

239 comments

135 comments · 37 top-level

brucethemoose22y ago· 23 in thread

In other llm news, Mistral/Yi finetunes trained with a new (still undocumented) technique called "neural alignment" are blasting other models in the HF leaderboard. The 7B is "beating" most 70Bs. The 34B in testing seems... Very good:

https://huggingface.co/fblgit/una-xaberius-34b-v1beta

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.

This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.

eurekin2y ago

I just played with 7b version. It really feels different than anything I tried before. It could explain a docker compose file. It generated a simple vue application component.

I asked around a bit about the example and it was strangely coherent and focused across the whole conversation. It was really well detecting, where I'm starting a new thread (without clearing a context) or referring to things before.

It caught me off guard as well with this:

> me: What does following mean [content of the docker compose]

> cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies

I've never seen any model using my exact wording in quotes in conversation like that.

mark_l_watson2y ago

How did you run it? Are there model files in Ollama format? Are you running on NVidia or Apple Silicon?

EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”

3 more replies

brucethemoose22y ago

Yeah, the Yi version is quite something too.

nikvdp2y ago

This piqued my interest so I made an ollama modelfile of it for the smallest variant (from TheBloke's GGUF [1] version). It does indeed seem impressively gpt4-ish for such a small model! Feels more coherent than openhermes2.5-mistral which was my previous goto local llm.

If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.

[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF

fblgit2y ago

Correct. UNA can align the MoE at multiple layers, experts, nearly any part of the neural network I would say. Xaberius 34B v1 "BETA".. is the king, and its just that.. the beta. I'll be focusing on the Mixtral, its a christmas gift.. modular in that way, thanks for the lab @mistral!

brucethemoose22y ago

Do a Yi 200K version as well! That would make my Christmas, as Mistral Moe is only maybe 32K.

inciampati2y ago

Do you have any docs describing the method?

stavros2y ago

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

sbierwagen2y ago

If you don't like machine evaluations, you can take a look at the lmsys chatbot arena. You give a prompt, two chatbots answer anonymously, and you pick which answer is better: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.

(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)

2 more replies

puttycat2y ago

I wonder how it will rank on benchmarks which are password-protected to prevent test contamination, for example: https://github.com/taucompling/bliss

brucethemoose22y ago

Yes, absolutely. I was just preaching this.

But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.

typon2y ago

Yes. The only thing that is relevant is a hidden benchmark that's never released and run by a trusted third party.

nabakin2y ago

More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.

Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.

In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.

You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.

Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.

4 more replies

screye2y ago

Yeah, and Mistral doesn't particularly care about lobotomizing the model with 'safety-training'. So it can achieve much better performance per-parameter than anthropic/google/OpenAI while being more steerable as well.

behnamoh2y ago

until Mistral gets too big for lawyers to ignore.

_boffin_2y ago

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?

brucethemoose22y ago

Is complicated.

The 7B model (cybertron) is trained on Mistral. Mistral is technically a 32K model, but it uses a sliding window beyond 32K, and for all practical purposes in current implementations it behaves like an 8K model.

The 34B model is based on Yi 34B, which is inexplicably marked as a 4K model in the config but actually works out to 32K if you literally just edit that line. Yi also has a 200K base model... and I have no idea why they didn't just train on that. You don't need to finetune at long context to preserve its long context ability.

1 more reply

whimsicalism2y ago

DPO is pretty good as well.

I think that the '7b beating 70b' is mostly due to the fact that Mistral is likely trained on considerably more tokens than Chinchilla optimal. So is llama-70b, but not to the same degree.

3abiton2y ago

HF leaderboards are rarely reflective of real world performance especially in small variations, but nonetheless, this is promising. What are the HW requirements for this latest Mistral7B?

eyegor2y ago

Any 7b can run well (~50 tok/s) on an 8gb gpu if you tune the context size. 13b can sometimes run well but typically you'll end up with a tiny context window or slow inference. For cpu, I wouldn't recommend going above 1.3b unless you don't mind waiting around.

1 more reply

brucethemoose22y ago

> What are the HW requirements for this latest Mistral7B

Pretty much anything with ~6-8GB of memory that's not super old.

It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.

Small, CPU only servers are kinda the only questionable thing. It runs, just not very fast, especially with long prompts (which are particularly hard for CPUs). There's also not a lot of support for AI ASICs.

swyx2y ago

what is neural alignment? who came up with it?

brucethemoose22y ago

@fblgit apparently, from earlier in this thread.

kcorbitt2y ago· 18 in thread

No public statement from Mistral yet. What we know:

- Mixture of Experts architecture.

- 8x 7B parameters experts (potentially trained starting with their base 7B model?).

- 96GB of weights. You won't be able to run this on your home GPU.

MacsHeadroom2y ago

That is only 24GB in 4bit.

People are running models 2-4 times that size on local GPUs.

What's more, this will run on a MacBook CPU just fine-- and at an extremely high speed.

brucethemoose22y ago

Yeah, 70B is much larger and fits on a 24GB, admitedly with very lossy quantization.

This is just about right for 24GB. I bet that is intentional on their part.

coder5432y ago

> 96GB of weights. You won't be able to run this on your home GPU.

This seems like a non-sequitur. Doesn't MoE select an expert for each token? Presumably, the same expert would frequently be selected for a number of tokens in a row. At that point, you're only running a 7B model, which will easily fit on a GPU. It will be slower when "swapping" experts if you can't fit them all into VRAM at the same time, but it shouldn't be catastrophic for performance in the way that being unable to fit all layers of an LLM is. It's also easy to imagine caching the N most recent experts in VRAM, where N is the largest number that still fits into your VRAM.

ttul2y ago

Someone smarter will probably correct me, but I don’t think that is how MoE works. With MoE, a feed-forward network assesses the tokens and selects the best two of eight experts to generate the next token. The choice of experts can change with each new token. For example, let’s say you have two experts that are really good at answering physics questions. For some of the generation, those two will be selected. But later on, maybe the context suggests you need two models better suited to generate French language. This is a silly simplification of what I understand to be going on.

3 more replies

read_if_gay_2y ago

however, if you need to swap experts on each token, you might as well run on cpu.

1 more reply

numeri2y ago

You're not necessarily wrong, but I'd imagine this is almost prohibitively slow. Also, this model seems to use two experts per token.

tarruda2y ago

I will be super happy if this is true.

Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.

1 more reply

tarruda2y ago

Theoretically it could fit into a single 24GB GPU if 4-bit quantized. Exllama v2 has even more efficient quantization algorithm, and was able to fit 70B models in 24GB gpu, but only with 2048 tokens of context.

jlokier2y ago

> - 96GB of weights. You won't be able to run this on your home GPU.

You can these days, even in a portable device running on battery.

96GB fits comfortably in some laptop GPUs released this year.

refulgentis2y ago

This is extremely misleading. source: been working in local LLMs since 10 months ago. Got my Mac laptop too. I'm bullish too. But we shouldn't breezily dismiss those concerns out of hand. In practice, it's single digit tokens a second on a $4500 laptop for a model with weights half this size (Llama 2 70B Q2 GGUF => 29 GB, Q8 => 36 GB)

2 more replies

michaelt2y ago

Be a lot cooler if you said what laptop, and how much quantisation you're assuming :)

2 more replies

shubb2y ago

>> You won't be able to run this on your home GPU.

Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

dragonwriter2y ago

> Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

I would think no differently than you can run a large regular model on a multiGPU setup (which people do!). Its still all one network even if not all of it is activated for each token, and since its much smaller than a 56B model, it seems like there are significant components of the network that are shared.

1 more reply

terafo2y ago

Yes, but you wouldn't want to do that. You will be able to run that on a single 24gb GPU by the end of this weekend.

1 more reply

miven2y ago

> You won't be able to run this on your home GPU.

As far as I understand in a MOE model only one/few experts are actually used at the same time, shouldn't the inference speed for this new MOE model be roughly the same as for a normal Mistral 7B then?

7B models have a reasonable throughput when ran on a beefy CPU, especially when quantized down to 4bit precision, so couldn't Mixtral be comfortably ran on a CPU too then, just with 8 times the memory footprint?

filterfiber2y ago

So this specific model ships with a default config of 2 experts per token.

So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.

Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.

So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.

~~~Note on quantization, iirc smaller models lose more performance when quantized vs larger models. So this would be the speed of a 4bit 13B model but with the penalty from a 4bit 7B model.~~~ Actually I have zero idea how quantization scales for MoE, I imagine it has the penalty I mentioned but that's pure speculation.

faldore2y ago

at 4 bits you could run it on a 3090 right?

brucethemoose22y ago

Its crazy how the 3090 is such a ubiquitous local llm card these days. I despise Nvidia on linux... And yet I ended up with a 3090.

How are AMD/Intel totally missing this boat?

1 more reply

MyFirstSass2y ago· 10 in thread

Hot take but Mistral 7B is the actual state of the art of LLM's.

ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but it's huge, runs on server farms far away and is more or less a black box.

Mistral is tiny, and amazingly coherent and useful for it's size for both general questions and code, uncensored, and a leap i wouldn't have believed possible in just a year.

I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

ipsum22y ago

State of the art for something you can run on a Macbook air, but not state of the art for LLMs, or even open source. Yi 34B and Llama2 70B still beat it.

MyFirstSass2y ago

True but it's ahead of the competition when size is considered, which is why i really look forward to their 13B, 33B models etc. because if they are as potent who knows what leaps we'll take soon.

I remember running llama1 33B 8 months ago that as i remember was on Mistral 7B's level while other 7B models were a rambling mess.

The jump in "potency" is what is so extreme.

emporas2y ago

Given that 50% of all information consumed in the internet is produced in the last 24 hours, smaller models could hold a serious advantage over bigger models.

If an LLM or a SmallLM can be retrained or fine-tuned constantly, every week or every day to incorporate recent information then outdated models trained a year or two years back hold no chance to keep up. Dunno about the licensing but OpenAI could incorporate a smaller model like Mistral7B into their GPT stack, re-train it from scratch every week, and charge the same as GPT-4. There are users who might certainly prefer the weaker, albeit updated models.

refulgentis2y ago

It's much easier to do RAG than try to shoehorn the entirety of the universe into 7B parameters every 24 hours. Mistral's great at being coherent and processing info at 7B, but you wouldn't want it as an oracle.

1 more reply

andy_xor_andrew2y ago

I am with you on this. Mistral 7B is amazingly good. There are finetunes of it (the Intel one, and Berkeley Starling) that feel like they are within throwing distance of gpt3.5T... at only 7B!

I was really hoping for a 13B Mistral. I'm not sure if this MOE will run on my 3090 with 24GB. Fingers crossed that quantization + offloading + future tricks will make it runnable.

MyFirstSass2y ago

True i've been using the OpenOrca finetune and just downloaded the new UNA Cybertron model both tuned on the Mistral base.

They are not far from GPT-3 logic wise i'd say if you consider the breadth of data, ie. very little in 7GB's; so missing other languages, niche topics and prose styles etc.

I honestly wouldn't be surprised if 13B would be indistinguishable from GPT-3.5 on some levels. And if that is the case - then coupled with the latest developments in decoding - like Ultrafastbert, Speculative, Jacobi, Lookahead etc. i honestly wouldn't be surprised to see local LLM's on current GPT-4 level within a few years.

tarruda2y ago

> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

That seems kinda low, are you using Metal GPU acceleration with llama.cpp? I don't have a macbook, but saw some of the llama.cpp benchmarks that suggest it can reach close to 30tk/s with GPU acceleration.

MyFirstSass2y ago

Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.

If anyone has faster than 12tkps on Air's let me know.

I'm using the LM Studio GUI over llama.cpp with the "Apple Metal GPU" option. Increasing CPU threads seemingly does nothing either without metal.

Ram usage hovers at 5.5GB with a q5_k_m of Mistral.

2 more replies

123yawaworht4562y ago

it really is. it feels at the very least equal to llama2 13b. if mistral 70b had existed and was as much an improvement over llama2 70b as it is at 7b size, it would definitely be on part with gpt3.5

nabakin2y ago

Not a hot take, I think you're right. If it was scaled up to 70b, I think it would be better than Llama 2 70b. Maybe if it was then scaled up to 180b and turned into a MoE it would be better than GPT-4.

mareksotak2y ago· 6 in thread

Some companies spend weeks on landing pages, demos and cute thought through promo videos and then there is Mistral, casually dropping a magnet link on Friday.

tananaev2y ago

I'm sure it's also a marketing move to build a certain reputation. Looks like it's working.

HlessClaudesman2y ago

Not geoblocking the entirety of Europe also makes them stand out like a ringmaster amongst clowns.

2 more replies

throwaway4aday2y ago

technically, it is marketing but at this level marketing is indistinguishable from shipping

tarruda2y ago

I'm curious about their business model.

jorge-d2y ago

Well so far their business model seems to be mostly centered about raising money[1]. I do hope they succeed in becoming a succesful contender against OpenAI.

[1] https://www.bloomberg.com/news/articles/2023-12-04/openai-ri...

1 more reply

nuz2y ago

They can make plenty by offering consulting fees for finetuning and general support around their models.

3 more replies

BryanLegend2y ago· 5 in thread

Andrej Karpathy's take:

New open weights LLM from @MistralAI

params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2

Likely related code: https://github.com/mistralai/megablocks-public

Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.

https://twitter.com/karpathy/status/1733181701361451130

henrysg2y ago

> Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

crakenzak2y ago

> it's because the biggest deep learning conference (NeurIPS) is next week.

Can we expect some big announcements (new architectures, models, etc) at the conference from different companies? Sorry, not too familiar what the culture for research conferences is.

jbarrow2y ago

Typically not. Google as an example: the transformer paper (Vaswani et al., 2017) was arxiv'd in June of 2017, and NeurIPS (the conference in which it was published) was in December of that year; BERT (Devlin et al., 2019) was similarly arxiv'd before publication.

Recent announcements from companies tend to be even more divorced from conference dates, as they release anemic "Technical Reports" that largely wouldn't pass muster in a peer review.

GaggiX2y ago

>-hidden_dim / dim = 14336/4096 => 3.5X MLP expand

>- n_heads / n_kv_heads = 32/8 => 4X

These two are exactly the same as the old Mistral-7B

Der_Einzige2y ago

Also, because EMNLP 2023 is happening right now.

tarruda2y ago· 5 in thread

Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.

MacsHeadroom2y ago

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

dragonwriter2y ago

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

2 more replies

brucethemoose22y ago

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

seydor2y ago

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

brucethemoose22y ago

Not really, looks like a ~40B class model which is very runnable.

1 more reply

nulld3v2y ago· 4 in thread

Looks to be Mixture of Experts, here is the params.json:

    {
        "dim": 4096,
        "n_layers": 32,
        "head_dim": 128,
        "hidden_dim": 14336,
        "n_heads": 32,
        "n_kv_heads": 8,
        "norm_eps": 1e-05,
        "vocab_size": 32000,
        "moe": {
            "num_experts_per_tok": 2,
            "num_experts": 8
        }
    }

sockaddr2y ago

What does expert mean in this context?

moffkalast2y ago

It means it's 8 7B models in a trench coat in a sense, it runs as fast as a 14B (2 experts at a time apparently) but takes up as much memory as a 40B model (70% * 8 * 7B). There is some process trained into it that chooses which experts to use based on the question posed. GPT 4 is allegedly based on the same architecture, but at 8*222B.

7 more replies

sp3322y ago

I don't see any code in there. What runtime could load these weights?

brucethemoose22y ago

Its presumably llama just like Mistral.

Everything open source is llama now. Facebook all but standardized the architecture.

I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.

2 more replies

leobg2y ago· 4 in thread

I love Mistral.

It’s crazy what can be done with this small model and 2 hours of fine tuning.

Chatbot with function calling? Check.

90 +% accuracy multi label classifier, even when you only have 15 examples for each label? Check.

Craaaazy powerful.

leodriesch2y ago

Could you link me to a finetune optimized for function calling? I was looking for one a few weeks ago but did not find any.

leobg2y ago

See sibling comment.

jeanloolz2y ago

Can you point me to a function calling fine tune mistral model? This is the only feature that keeps me from migrating away from openai. I searched a few time but could not find anything in HG

leobg2y ago

Can’t share the model, since it was trained for a client. I don’t know if any public datasets exist. But Mistral will learn what you throw at it. So if you build a dataset of chat conversations that contains, say, answers in the form of {“answer”:”The answer”, “image”:”Prompt for stable diffusion”}, you’ll get a model that can generate images, and also will know when to use that capability. It’s insane how well that works.

maremmano2y ago· 4 in thread

Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?

marci2y ago

If I understand correctly:

RAM Wise, you can easily run a 70b with 128GB, 8x7B is obviously less than that.

Compute wise, I suppose it would be a bit slower than running a 13b.

edit: "actually", I think it might be faster than a 13b. 8 random 7b ~= 115GB, Mixtral is under 90. I will have to wait for more info/understanding.

treprinum2y ago

I would say so based on LLaMA 2 70B; if it's 8x inference in MoE then I guess you'd see <20 tokens/sec?

M4v3R2y ago

Big chance that you’ll be able to run it using Ollama app soon enough.

deoxykev2y ago

I would like to know this as well.

aubanel2y ago· 3 in thread

Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!

brucethemoose22y ago

I will take weights over docs.

Its does remind me how some Google employee was bragging that they disclosed the weights for the Gemini, and only the small mobile Gemini, as if that's a generous step over other companies.

refulgentis2y ago

I don't think that's true, because quite simply, they have not.

I am 100% in agreement with your viewpoint, but feel squeamish seeing an un-needed lie coupled to it to justify it. Just so much Othering these days.

1 more reply

whimsicalism2y ago

they did not disclose the weights for any gemini, you must have misunderstood

maremmano2y ago· 3 in thread

Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...

eurekin2y ago

I find that a way more bold and confident than dropping a obviously manipulated and unrealistic marketing page or video

maremmano2y ago

Frankly I don't know why Google continues to act this way. Let's remind the "Google Duplex: A.I. Assistant Calls Local Businesses To Make Appointments" story. https://www.youtube.com/watch?v=D5VN56jQMWM

Not that this affects Google's user base in any way, at the moment.

2 more replies

seydor2y ago

FILE_ID.DIZ

politician2y ago· 3 in thread

Honest question: Why isn't this on Huggingface? Is this one a leaked model with a questionable training or alignment methodology?

EDIT: I mean, I guess they didn't hack their own twitter account, but still.

kcorbitt2y ago

It'll be on Huggingface soon. This is how they dropped their original 7B model as well. It's a marketing thing, but it works!

politician2y ago

Ah, well, ok. I appreciate the torrent link -- much faster distribution.

1 more reply

politician2y ago

@kcorbitt Low priority, probably not worth an email: Does using OpenPipe.ai to fine-tune a model result in a downloadable LoRA adapter? It's not clear from the website if the fine-tune is exportable.

seydor2y ago· 2 in thread

looks like they're too busy being awesome. i need a fake video to understand this!

What memory will this need? I guess it won't run on my 12GB of vram

"moe": {"num_experts_per_tok": 2, "num_experts": 8}

I bet many people will re-discover bittorrent tonight

brucethemoose22y ago

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

syntaxing2y ago

BitTorrent was the craze when llama was leaked on torrent. Then Facebook started taking down all huggingface repos and a bunch of people transitioned to torrent released temporarily. llama 2 changed all this but it was a fun time.

YetAnotherNick2y ago· 2 in thread

86 GB. So it's likely a Mixture of experts model with 8 experts. Exciting.

tarruda2y ago

Damn, I was hoping it was still a single 7B model that I would be able to run on my GPU

renonce2y ago

You can, wait for a 4-bit quantized version

1 more reply

_uqgj2y ago· 2 in thread

multimodal? 32k context is pretty impressive, curious to test instructability

brucethemoose22y ago

MistralLite is already 32K, and Yi 200K actually works pretty well out to at least 75K (the most I tested)

civilitty2y ago

What kind of tests did you run out to that length? (Needle in haystack, summarization, structured data extraction, etc)

What is the max number of tokens in the output?

1 more reply

_fizz_buzz_2y ago· 1 in thread

Does anybody have a tutorial or documentation how I can run this and play around with this locally. A „getting started“ guide of sorts?

0cf8612b2e1e2y ago

Even better if a llamafile gets released.

cloudhan2y ago· 1 in thread

Might be the training code related with the model https://github.com/mistralai/megablocks-public/tree/pstock/m...

cloudhan2y ago

Mixtral-8x7B support --> Support new model

https://github.com/stanford-futuredata/megablocks/pull/45

asolidtime12y ago· 1 in thread

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...

Holy shit, this is some clever marketing.

Kinda wonder if any of their employees were part of the warez scene at some point.

userbinator2y ago

They certainly got that aesthetic right; the only thing that stands out (but might be a necessity) is using real names instead of handles.

yodsanklai2y ago· 1 in thread

Can anyone explain what this means?

ukuina2y ago

Possibly a huge leap forward in open-source model capability. GPT4's prowess supposedly comes from strong dataset + RLHF + MoE (Mixture of Experts).

Mixtral brings MoE to an already-powerful model.

sigmar2y ago

Not exactly similar companies in terms of their goals, but pretty hilarious to contrast this model announcement with Google's Gemini announcement two days ago.

cuuupid2y ago

Stark contrast with Google's "all demo no model" approach from earlier this week! Seems to be trained off Stanford's Megablocks: https://github.com/mistralai/megablocks-public

jpdus2y ago

We now have a (experimental) working HF version here: https://huggingface.co/DiscoResearch/mixtral-7b-8expert

manojlds2y ago

Google - Fake demo

Mistral - magnet link and that's it

fortunefox2y ago

Releasing a model with a magnet link and some ascii art gives me way more confidence in the product than any OpenAI blog post ever could.

Excited to play with this once it's somewhat documented on how to get it running on a dual 4090 Setup.

dzhulgakov2y ago

You can try Mixtral live at https://app.fireworks.ai/ (soon to be faster too)

Warning: the implementation might be off as there's no official one. We at Fireworks tried to reverse-engineer model architecture today with the help of awsome folks from the community. The generations look reasonably good, but there might be some details missing.

If you want to follow the reverse-engineering story: https://twitter.com/dzhulgakov/status/1733330954348085439

balnazzar2y ago

Might be relevant: https://twitter.com/dzhulgakov/status/1733217065811742863.

Anyway, if the vanilla version requires 2x80gb cards, I wonder how would it run on a M2 Ultra 192gb Mac Studio.

Anyone having the machine could try?

udev40962y ago

https://nitter.rawbit.ninja/MistralAI/status/173315051239503...

smlacy2y ago

https://nitter.net/MistralAI/status/1733150512395038967

lxe2y ago

If anyone can help running this, would be appreciated. Resources so far:

- https://github.com/dzhulgakov/llama-mistral

swah2y ago

Kinda following all this stuff from outside w/o really understanding, but why are these things released like this, instead of "competing ChatGPTs apps" with higher and higher quality/costs? Could be open sourced but also hosted version that is maybe 5 usd/minute - if the results are great I guess people would pay the fair price...

Is it mainly because its hard to apply the limitations so that it doesn't spit out bomb making instructions?

udev40962y ago

based mistral casually dropping a magnet link

sergiotapia2y ago

Stuck on "Retrieving data" from the Magnet link and "Downloading metadata" when adding the magnet to the download list.

I had to manually add these trackers and now it works: https://gist.github.com/mcandre/eab4166938ed4205bef4

Jayakumark2y ago

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen

1 more reply

stevebmark2y ago

Mistral Mixtral Model Magnet

lagniappe2y ago

Magnet link says invalid for me

ahmetkca2y ago

Let’s go multimodal

poulpy1232y ago

is it eight 7b models in a trench coat ?

j / k navigate · click thread line to collapse

239 comments

135 comments · 37 top-level

brucethemoose22y ago· 23 in thread

https://huggingface.co/fblgit/una-xaberius-34b-v1beta

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

I mention this because it could theoretically be applied to Mistral Moe. If the uplift is the same as regular Mistral 7B, and Mistral Moe is good, the end result is a scary good model.

This might be an inflection point where desktop-runnable OSS is really breathing down GPT-4's neck.

eurekin2y ago

I just played with 7b version. It really feels different than anything I tried before. It could explain a docker compose file. It generated a simple vue application component.

It caught me off guard as well with this:

> me: What does following mean [content of the docker compose]

> cybertron-7b: In the provided YAML configuration, "following" refers to specifying dependencies

I've never seen any model using my exact wording in quotes in conversation like that.

mark_l_watson2y ago

How did you run it? Are there model files in Ollama format? Are you running on NVidia or Apple Silicon?

EDIT: just saw this “ Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.”

3 more replies

brucethemoose22y ago

Yeah, the Yi version is quite something too.

nikvdp2y ago

If you have ollama installed you can try it out with `ollama run nollama/una-cybertron-7b-v2`.

[1]: https://huggingface.co/TheBloke/una-cybertron-7B-v2-GGUF

fblgit2y ago

brucethemoose22y ago

Do a Yi 200K version as well! That would make my Christmas, as Mistral Moe is only maybe 32K.

inciampati2y ago

Do you have any docs describing the method?

stavros2y ago

Aren't LLM benchmarks at best irrelevant, at worst lying, at this point?

sbierwagen2y ago

On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.

2 more replies

puttycat2y ago

I wonder how it will rank on benchmarks which are password-protected to prevent test contamination, for example: https://github.com/taucompling/bliss

brucethemoose22y ago

Yes, absolutely. I was just preaching this.

But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.

typon2y ago

Yes. The only thing that is relevant is a hidden benchmark that's never released and run by a trusted third party.

nabakin2y ago

More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.

Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.

In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.

You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.

Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.

4 more replies

screye2y ago

behnamoh2y ago

until Mistral gets too big for lawyers to ignore.

_boffin_2y ago

Interesting. One thing i noticed is that Mistral has a `max_position_embeddings` of ~32k while these have it at 4096.

Any thoughts on that?

brucethemoose22y ago

Is complicated.

1 more reply

whimsicalism2y ago

DPO is pretty good as well.

I think that the '7b beating 70b' is mostly due to the fact that Mistral is likely trained on considerably more tokens than Chinchilla optimal. So is llama-70b, but not to the same degree.

3abiton2y ago

HF leaderboards are rarely reflective of real world performance especially in small variations, but nonetheless, this is promising. What are the HW requirements for this latest Mistral7B?

eyegor2y ago

1 more reply

brucethemoose22y ago

> What are the HW requirements for this latest Mistral7B

Pretty much anything with ~6-8GB of memory that's not super old.

It will run on my 6GB laptop RTX 2060 extremely quickly. It will run on my IGP or Phone with MLC-LLM. It will run fast on a laptop with a small GPU, with the rest offloaded to CPU.

swyx2y ago

what is neural alignment? who came up with it?

brucethemoose22y ago

@fblgit apparently, from earlier in this thread.

kcorbitt2y ago· 18 in thread

No public statement from Mistral yet. What we know:

- Mixture of Experts architecture.

- 8x 7B parameters experts (potentially trained starting with their base 7B model?).

- 96GB of weights. You won't be able to run this on your home GPU.

MacsHeadroom2y ago

That is only 24GB in 4bit.

People are running models 2-4 times that size on local GPUs.

What's more, this will run on a MacBook CPU just fine-- and at an extremely high speed.

brucethemoose22y ago

Yeah, 70B is much larger and fits on a 24GB, admitedly with very lossy quantization.

This is just about right for 24GB. I bet that is intentional on their part.

coder5432y ago

> 96GB of weights. You won't be able to run this on your home GPU.

ttul2y ago

3 more replies

read_if_gay_2y ago

however, if you need to swap experts on each token, you might as well run on cpu.

1 more reply

numeri2y ago

You're not necessarily wrong, but I'd imagine this is almost prohibitively slow. Also, this model seems to use two experts per token.

tarruda2y ago

I will be super happy if this is true.

Even if you can't fit all of them in the VRAM, you could load everything in tmpfs, which at least removes disk I/O penalty.

1 more reply

tarruda2y ago

jlokier2y ago

> - 96GB of weights. You won't be able to run this on your home GPU.

You can these days, even in a portable device running on battery.

96GB fits comfortably in some laptop GPUs released this year.

refulgentis2y ago

2 more replies

michaelt2y ago

Be a lot cooler if you said what laptop, and how much quantisation you're assuming :)

2 more replies

shubb2y ago

>> You won't be able to run this on your home GPU.

Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

dragonwriter2y ago

> Would this allow you to run each expert on a cheap commodity GPU card so that instead of using expensive 200GB cards we can use a computer with 8 cheap gaming cards in it?

1 more reply

terafo2y ago

Yes, but you wouldn't want to do that. You will be able to run that on a single 24gb GPU by the end of this weekend.

1 more reply

miven2y ago

> You won't be able to run this on your home GPU.

filterfiber2y ago

So this specific model ships with a default config of 2 experts per token.

So you need roughly two loaded in memory per token. Roughly the speed and memory of a 13B per token.

Only issues is that's per-token. 2 experts are choosen per token, which means if they aren't the same ones as the last token, you need to load them into memory.

So yeah to not be disk limited you'd need roughly 8 times the memory and it would run at the speed of a 13B model.

faldore2y ago

at 4 bits you could run it on a 3090 right?

brucethemoose22y ago

Its crazy how the 3090 is such a ubiquitous local llm card these days. I despise Nvidia on linux... And yet I ended up with a 3090.

How are AMD/Intel totally missing this boat?

1 more reply

MyFirstSass2y ago· 10 in thread

Hot take but Mistral 7B is the actual state of the art of LLM's.

ChatGPT 4 is amazing yes and i've been a day 1 subscriber, but it's huge, runs on server farms far away and is more or less a black box.

Mistral is tiny, and amazingly coherent and useful for it's size for both general questions and code, uncensored, and a leap i wouldn't have believed possible in just a year.

I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

ipsum22y ago

State of the art for something you can run on a Macbook air, but not state of the art for LLMs, or even open source. Yi 34B and Llama2 70B still beat it.

MyFirstSass2y ago

True but it's ahead of the competition when size is considered, which is why i really look forward to their 13B, 33B models etc. because if they are as potent who knows what leaps we'll take soon.

I remember running llama1 33B 8 months ago that as i remember was on Mistral 7B's level while other 7B models were a rambling mess.

The jump in "potency" is what is so extreme.

emporas2y ago

Given that 50% of all information consumed in the internet is produced in the last 24 hours, smaller models could hold a serious advantage over bigger models.

refulgentis2y ago

1 more reply

andy_xor_andrew2y ago

I am with you on this. Mistral 7B is amazingly good. There are finetunes of it (the Intel one, and Berkeley Starling) that feel like they are within throwing distance of gpt3.5T... at only 7B!

I was really hoping for a 13B Mistral. I'm not sure if this MOE will run on my 3090 with 24GB. Fingers crossed that quantization + offloading + future tricks will make it runnable.

MyFirstSass2y ago

True i've been using the OpenOrca finetune and just downloaded the new UNA Cybertron model both tuned on the Mistral base.

They are not far from GPT-3 logic wise i'd say if you consider the breadth of data, ie. very little in 7GB's; so missing other languages, niche topics and prose styles etc.

tarruda2y ago

> I can run it on my Macbook Air at 12tkps, can't wait to try this on my desktop.

MyFirstSass2y ago

Thanks for the tip. I'm on the M2 Air with 16 GB's of ram.

If anyone has faster than 12tkps on Air's let me know.

I'm using the LM Studio GUI over llama.cpp with the "Apple Metal GPU" option. Increasing CPU threads seemingly does nothing either without metal.

Ram usage hovers at 5.5GB with a q5_k_m of Mistral.

2 more replies

123yawaworht4562y ago

it really is. it feels at the very least equal to llama2 13b. if mistral 70b had existed and was as much an improvement over llama2 70b as it is at 7b size, it would definitely be on part with gpt3.5

nabakin2y ago

mareksotak2y ago· 6 in thread

Some companies spend weeks on landing pages, demos and cute thought through promo videos and then there is Mistral, casually dropping a magnet link on Friday.

tananaev2y ago

I'm sure it's also a marketing move to build a certain reputation. Looks like it's working.

HlessClaudesman2y ago

Not geoblocking the entirety of Europe also makes them stand out like a ringmaster amongst clowns.

2 more replies

throwaway4aday2y ago

technically, it is marketing but at this level marketing is indistinguishable from shipping

tarruda2y ago

I'm curious about their business model.

jorge-d2y ago

Well so far their business model seems to be mostly centered about raising money[1]. I do hope they succeed in becoming a succesful contender against OpenAI.

[1] https://www.bloomberg.com/news/articles/2023-12-04/openai-ri...

1 more reply

nuz2y ago

They can make plenty by offering consulting fees for finetuning and general support around their models.

3 more replies

BryanLegend2y ago· 5 in thread

Andrej Karpathy's take:

New open weights LLM from @MistralAI

params.json: - hidden_dim / dim = 14336/4096 => 3.5X MLP expand - n_heads / n_kv_heads = 32/8 => 4X multiquery - "moe" => mixture of experts 8X top 2

Likely related code: https://github.com/mistralai/megablocks-public

Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

If people are wondering why there is so much AI activity right around now, it's because the biggest deep learning conference (NeurIPS) is next week.

https://twitter.com/karpathy/status/1733181701361451130

henrysg2y ago

> Oddly absent: an over-rehearsed professional release video talking about a revolution in AI.

crakenzak2y ago

> it's because the biggest deep learning conference (NeurIPS) is next week.

Can we expect some big announcements (new architectures, models, etc) at the conference from different companies? Sorry, not too familiar what the culture for research conferences is.

jbarrow2y ago

Recent announcements from companies tend to be even more divorced from conference dates, as they release anemic "Technical Reports" that largely wouldn't pass muster in a peer review.

GaggiX2y ago

>-hidden_dim / dim = 14336/4096 => 3.5X MLP expand

>- n_heads / n_kv_heads = 32/8 => 4X

These two are exactly the same as the old Mistral-7B

Der_Einzige2y ago

Also, because EMNLP 2023 is happening right now.

tarruda2y ago· 5 in thread

Still 7B, but now with 32k context. Looking forward to see how it compares with the previous one, and what the community does with it.

MacsHeadroom2y ago

Not 7B, 8x7B.

It will run with the speed of a 7B model while being much smarter but requiring ~24GB of RAM instead of ~4GB (in 4bit).

dragonwriter2y ago

Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.

2 more replies

brucethemoose22y ago

We can't infer the actual context size from the config.

Mistral 7B is basically an 8K model, but was marked as a 32K one.

seydor2y ago

unfortunately too big for the broader community to test. Will be very interesting to see how well it performs compared to the large models

brucethemoose22y ago

Not really, looks like a ~40B class model which is very runnable.

1 more reply

nulld3v2y ago· 4 in thread

Looks to be Mixture of Experts, here is the params.json:

    {
        "dim": 4096,
        "n_layers": 32,
        "head_dim": 128,
        "hidden_dim": 14336,
        "n_heads": 32,
        "n_kv_heads": 8,
        "norm_eps": 1e-05,
        "vocab_size": 32000,
        "moe": {
            "num_experts_per_tok": 2,
            "num_experts": 8
        }
    }

sockaddr2y ago

What does expert mean in this context?

moffkalast2y ago

7 more replies

sp3322y ago

I don't see any code in there. What runtime could load these weights?

brucethemoose22y ago

Its presumably llama just like Mistral.

Everything open source is llama now. Facebook all but standardized the architecture.

I dunno about the moe. Is there existing transformers code for that part? It kinda looks like there is based on the config.

2 more replies

leobg2y ago· 4 in thread

I love Mistral.

It’s crazy what can be done with this small model and 2 hours of fine tuning.

Chatbot with function calling? Check.

90 +% accuracy multi label classifier, even when you only have 15 examples for each label? Check.

Craaaazy powerful.

leodriesch2y ago

Could you link me to a finetune optimized for function calling? I was looking for one a few weeks ago but did not find any.

leobg2y ago

See sibling comment.

jeanloolz2y ago

Can you point me to a function calling fine tune mistral model? This is the only feature that keeps me from migrating away from openai. I searched a few time but could not find anything in HG

leobg2y ago

maremmano2y ago· 4 in thread

Who know if I can run this on MBC Pro M3 max 128gb? at what TPS?

marci2y ago

If I understand correctly:

RAM Wise, you can easily run a 70b with 128GB, 8x7B is obviously less than that.

Compute wise, I suppose it would be a bit slower than running a 13b.

edit: "actually", I think it might be faster than a 13b. 8 random 7b ~= 115GB, Mixtral is under 90. I will have to wait for more info/understanding.

treprinum2y ago

I would say so based on LLaMA 2 70B; if it's 8x inference in MoE then I guess you'd see <20 tokens/sec?

M4v3R2y ago

Big chance that you’ll be able to run it using Ollama app soon enough.

deoxykev2y ago

I would like to know this as well.

aubanel2y ago· 3 in thread

Mistral sure does not bother too much with explanations, but this style gives me much more confidence in the product than Google's polished, corporate, soulless announcement of Gemini!

brucethemoose22y ago

I will take weights over docs.

Its does remind me how some Google employee was bragging that they disclosed the weights for the Gemini, and only the small mobile Gemini, as if that's a generous step over other companies.

refulgentis2y ago

I don't think that's true, because quite simply, they have not.

I am 100% in agreement with your viewpoint, but feel squeamish seeing an un-needed lie coupled to it to justify it. Just so much Othering these days.

1 more reply

whimsicalism2y ago

they did not disclose the weights for any gemini, you must have misunderstood

maremmano2y ago· 3 in thread

Do you need some fancy announcement? let's do it the 90s way: https://twitter.com/erhartford/status/1733159666417545641/ph...

eurekin2y ago

I find that a way more bold and confident than dropping a obviously manipulated and unrealistic marketing page or video

maremmano2y ago

Not that this affects Google's user base in any way, at the moment.

2 more replies

seydor2y ago

FILE_ID.DIZ

politician2y ago· 3 in thread

Honest question: Why isn't this on Huggingface? Is this one a leaked model with a questionable training or alignment methodology?

EDIT: I mean, I guess they didn't hack their own twitter account, but still.

kcorbitt2y ago

It'll be on Huggingface soon. This is how they dropped their original 7B model as well. It's a marketing thing, but it works!

politician2y ago

Ah, well, ok. I appreciate the torrent link -- much faster distribution.

1 more reply

politician2y ago

@kcorbitt Low priority, probably not worth an email: Does using OpenPipe.ai to fine-tune a model result in a downloadable LoRA adapter? It's not clear from the website if the fine-tune is exportable.

seydor2y ago· 2 in thread

looks like they're too busy being awesome. i need a fake video to understand this!

What memory will this need? I guess it won't run on my 12GB of vram

"moe": {"num_experts_per_tok": 2, "num_experts": 8}

I bet many people will re-discover bittorrent tonight

brucethemoose22y ago

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

syntaxing2y ago

YetAnotherNick2y ago· 2 in thread

86 GB. So it's likely a Mixture of experts model with 8 experts. Exciting.

tarruda2y ago

Damn, I was hoping it was still a single 7B model that I would be able to run on my GPU

renonce2y ago

You can, wait for a 4-bit quantized version

1 more reply

_uqgj2y ago· 2 in thread

multimodal? 32k context is pretty impressive, curious to test instructability

brucethemoose22y ago

MistralLite is already 32K, and Yi 200K actually works pretty well out to at least 75K (the most I tested)

civilitty2y ago

What kind of tests did you run out to that length? (Needle in haystack, summarization, structured data extraction, etc)

What is the max number of tokens in the output?

1 more reply

_fizz_buzz_2y ago· 1 in thread

Does anybody have a tutorial or documentation how I can run this and play around with this locally. A „getting started“ guide of sorts?

0cf8612b2e1e2y ago

Even better if a llamafile gets released.

cloudhan2y ago· 1 in thread

Might be the training code related with the model https://github.com/mistralai/megablocks-public/tree/pstock/m...

cloudhan2y ago

Mixtral-8x7B support --> Support new model

https://github.com/stanford-futuredata/megablocks/pull/45

asolidtime12y ago· 1 in thread

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen/b...

Holy shit, this is some clever marketing.

Kinda wonder if any of their employees were part of the warez scene at some point.

userbinator2y ago

They certainly got that aesthetic right; the only thing that stands out (but might be a necessity) is using real names instead of handles.

yodsanklai2y ago· 1 in thread

Can anyone explain what this means?

ukuina2y ago

Possibly a huge leap forward in open-source model capability. GPT4's prowess supposedly comes from strong dataset + RLHF + MoE (Mixture of Experts).

Mixtral brings MoE to an already-powerful model.

sigmar2y ago

Not exactly similar companies in terms of their goals, but pretty hilarious to contrast this model announcement with Google's Gemini announcement two days ago.

cuuupid2y ago

Stark contrast with Google's "all demo no model" approach from earlier this week! Seems to be trained off Stanford's Megablocks: https://github.com/mistralai/megablocks-public

jpdus2y ago

We now have a (experimental) working HF version here: https://huggingface.co/DiscoResearch/mixtral-7b-8expert

manojlds2y ago

Google - Fake demo

Mistral - magnet link and that's it

fortunefox2y ago

Releasing a model with a magnet link and some ascii art gives me way more confidence in the product than any OpenAI blog post ever could.

Excited to play with this once it's somewhat documented on how to get it running on a dual 4090 Setup.

dzhulgakov2y ago

You can try Mixtral live at https://app.fireworks.ai/ (soon to be faster too)

If you want to follow the reverse-engineering story: https://twitter.com/dzhulgakov/status/1733330954348085439

balnazzar2y ago

Might be relevant: https://twitter.com/dzhulgakov/status/1733217065811742863.

Anyway, if the vanilla version requires 2x80gb cards, I wonder how would it run on a M2 Ultra 192gb Mac Studio.

Anyone having the machine could try?

udev40962y ago

https://nitter.rawbit.ninja/MistralAI/status/173315051239503...

smlacy2y ago

https://nitter.net/MistralAI/status/1733150512395038967

lxe2y ago

If anyone can help running this, would be appreciated. Resources so far:

- https://github.com/dzhulgakov/llama-mistral

swah2y ago

Is it mainly because its hard to apply the limitations so that it doesn't spit out bomb making instructions?

udev40962y ago

based mistral casually dropping a magnet link

sergiotapia2y ago

Stuck on "Retrieving data" from the Magnet link and "Downloading metadata" when adding the magnet to the download list.

I had to manually add these trackers and now it works: https://gist.github.com/mcandre/eab4166938ed4205bef4

Jayakumark2y ago

https://huggingface.co/someone13574/mixtral-8x7b-32kseqlen

1 more reply

stevebmark2y ago

Mistral Mixtral Model Magnet

lagniappe2y ago

Magnet link says invalid for me

ahmetkca2y ago

Let’s go multimodal

poulpy1232y ago

is it eight 7b models in a trench coat ?

j / k navigate · click thread line to collapse