Skip to content

Top Best Ask Show New Jobs

Mistral AI Launches New 8x22B MOE Model (opens in new tab)

(twitter.com)

379 pointsvarunvummadi2y ago153 comments

153 comments

83 comments · 21 top-level

freeqaz2y ago· 11 in thread

What's the easiest way to run this assuming that you have the weights and the hardware? Even if it's offloading half of the model to RAM, what tool do you use to load this? Ollama? Llama.cpp? Or just import it with some Python library?

Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?

I think the llamafile[0] system works the best. Binary works on the command line or launches a mini webserver. Llamafile offers builds of Mixtral-8x7B-Instruct, so presumably they may package this one up as well (potentially a quantized format).

You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?

[0] https://github.com/Mozilla-Ocho/llamafile

jart2y ago

llamafile author here. I'm downloading Mixtral 8x22b right now. I can't say for certain it'll work until I try it, but let's keep our fingers crossed! If not, we'll be shipping a release as soon as possible that gets it working.

My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.

noman-land2y ago

+1 on llamafile. You can point it to a custom model.

varunvummadiOP2y ago

The easiest is to use vllm (https://github.com/vllm-project/vllm) to run it on a Couple of A100's, and you can benchmark this using this library (https://github.com/EleutherAI/lm-evaluation-harness)

sheepscreek2y ago

In that regard, it’s even easier to use one Apple Studio with sufficient RAM and llama.cpp or even PyTorch for inference.

hmottestad2y ago

LM Studio is a great way to test out LLMs on my MacBook: https://lmstudio.ai/

Really easy to search huggingface for new models to test directly in the app.

LeoPanthera2y ago

Make sure you get the prompt template set correctly, the defaults are wrong for a lot of models.

bevekspldnw2y ago

There is a user called The Bloke on hugging face- they release pre quantized models pretty soon after the full size drop. Just watch their page and pray you can fit the 4 bit in your GPU.

I’m sure they are already working on it.

nathanasmith2y ago

TheBloke stopped uploading in January. There are others that have stepped up though.

MPSimmons2y ago

I think 4b for this is support to be over 70GB, so definitely still heavy hardware.

mritchie7122y ago

you can try it on together here:

https://api.together.xyz/playground/language/mistralai/Mixtr...

ein0p2y ago· 6 in thread

To this day 8x7b Mixtral remains the best model you can run on a single 48GB GPU. This has the potential to become the best model you can run on two such GPUs, or on an MBP with maxed out RAM, when 4-bit quantized.

ryao2y ago

I am looking forward to the pricing of those dropping. It is a shame that high memory graphics cards are not mainstream.

I hope i get it to run on my 96gb m2 in q4.

It actually does, in case anybody wonders. But it seems as if it's not fine tuned to chat, or i'm doing it wrong at the moment. Getting a lot of duplicates and non useful answers.

noman-land2y ago

My first thought was how much RAM? Will it work on 64GB M1?

jwitthuhn2y ago

It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.

ein0p2y ago

Nope. Just the weights would take 88GB at 4 bit. 128GB MBP ought to be able to run it. If I were to guess, a version for Apple MLX should be available within a few days, for those of us fortunate enough to own such a thing.

deoxykev2y ago· 5 in thread

4 bit quants should require 85GB VRAM, so this will fit nicely on 4x 24G consumer GPUs, plus some leftover for KV cache optimization.

qeternity2y ago

4bit should take up less than this, there are quite a few shared parameters between experts.

But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.

I've found the 2 bit quant of Mixtral 8x7B is usable for some purposes with an 8GB GPU. I'm curious how this new model will work in similar cheap 8-16GB GPU configurations.

reissbaker2y ago

16GB will be way too small unfortunately — this has over 3x the param count, so at best you're looking at a 24GB card with extreme 2bit quantization.

Really though if you're just looking to run models personally and not finetune (which requires monstrous amounts of VRAM), Macs are the way to go for this kind of mega model: Macs have unified memory between the GPU and CPU, and you can buy them with a lot of RAM. It'll be cheaper than trying to buy enough GPU VRAM. A Mac Studio with 192GB unified RAM is under $6k — two A6000s will run you over $9k and still only give you 96GB VRAM (and God help you if you try to build the equivalent system out of 4090s or A100s/H100s).

Or just rent the GPU time as needed from cloud providers like RunPod, although that may or may not be what you're looking for.

aydyn2y ago

AFAIK, 2-bit quant leads to too much loss of performance, such that you're better off using a different smaller model altogether. See here:

https://www.reddit.com/r/LocalLLaMA/comments/18ituzh/mixtral...

Wouldn't expect that to work at all.

abdullahkhalids2y ago· 5 in thread

Why are some of their models open, and others closed? What is their strategy?

Jackson__2y ago

My personal speculation is that their closed models are based on other companies' models.

For example on EQbench[0], Miqu[1], a leaked continued pretrain based on LLama2, performs extremely similar to the mistral medium model their API offers.

Maybe they're thinking it'd be bad PR for them to release models they didn't create from scratch, or there is some contractual obligation preventing the release.

[0]https://eqbench.com/index.html

[1]https://huggingface.co/miqudev/miqu-1-70b

moffkalast2y ago

That's quite likely, some have also speculated that Mistral 7B got some EU grant funding that stipulated it had to be openly released later, and Mixtral is based on Mistral 7B so it would likely be subject to the same terms. I haven't found any source to substantiate it though.

unraveller2y ago

Mistral have stated they want to chase the fine-tune dollar to support le research. We should get thrown a bone of hard to tune mid-range stuff occasionally. Especially when big announcements about small models are expected later in the week (llama3) or when haiku is stealing the thunder from mixtral 8x7b.

kvmet2y ago

It's gotta be either perceived value or training data/licensing restrictions.

blackeyeblitzar2y ago

I am not sure why some are open and some are closed - if I had to speculate, it’s perhaps that the commercial models help fund the team. They come with safety features built-in as well as API-based access (instead of needing to self-host). They word their mission (https://mistral.ai/company/#missions) as follows:

> Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders. This requires fierce independence, strong commitment to open, portable and customisable solutions, and an extreme focus on shipping the most advanced technology in limited time.

mlsu2y ago· 4 in thread

8x22b. If this is as good as Mixtral 8x7b we are in for a wonderful time.

I've heard command-r is first opensource to beat gpt4 in benchmarks

jxy2y ago

It's "Command R+". "Command R" is a smaller model.

varunvummadiOP2y ago

It beats the old GPT4 version in lmsys benchmark you can check it out here https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar... but Command R is commercially licensed We can assume that mistral will do a better job.

moralestapia2y ago

You mean better, right?

Why would you want another 8x7b, if you already have it ...

nazka2y ago· 4 in thread

Out of topic but are we now back at the same performance than ChatGPT 4 at the time people said it worked like magic (meaning before the nerf to make it more politically correct but making his performance crash)?

hmottestad2y ago

I’ve been testing a lot of LLMs on my MacBook and I would say that all of them are far away from being as good as GPT-4, at any time. Many are as good as GPT-3 though. There are also a lot of models that are fine tuned for specific tasks.

Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.

Eisenstein2y ago

Which ones have you tested? There were some huge ones released recently.

With open models, yes we are at the performance of at least the first release of ChatGPT 4.

sp3322y ago

Could you recommend one or a few in particular?

aurareturn2y ago· 4 in thread

Might be a dumb question but does this mean this model has 176B params?

idiliv2y ago

In Mixtral 8x7B, the 8 means that the model uses Mixture-of-Experts (MoE) layers with 8 experts. The 7B means that if you were to remove 7 of the 8 experts in each layer, then you would end up with a 7B model (which would have exactly the same architecture as Mistral 7B). Therefore, a 1x7B model has 7B params. An 8x7B model has 1 * 7B + (8-1) * sz_expert params, where sz_expert is some constant value that the MoE layers increase by when adding one expert. In the case of Mixtral 8x7B the model size is 46.3GB, so, sz_expert ≈ 5.6B.

If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.

KTibow2y ago

I tried to check this for myself.

I agreed for the first one, (46.3 - 7) / 7 = 5.61b.

The second one doesn't match up, (281 - 22) / 7 = 37b or (140.5 - 22) / 7 = 16.92b. Am I doing something wrong?

idiliv2y ago

Oh, and to answer your actual question: Assuming that the model is released with 16 bits per parameter, then it as 281GB / 16 bit = 140.5 parameters.

hovering_nox2y ago

8x7 had 46B or so.

resource_waste2y ago· 4 in thread

What is the excitement around models that arent as good as llama?

This is clearly an inferior model that they are willing to share for marketing purposes.

If it was an improvement over llama, sure, but it seems like just an ad for bad AI.

Me10002y ago

Mixtral 7x8b was way better than llama2 70b and used less RAM and compute at the same time. This model is way better than llama.

In fact I would go as far as saying llama2 isn’t that good compared to some of the most recent models.

jeppebemad2y ago

We use their earlier Mixtral model because it outperforms llama for our use case. They do not release full models for marketing purposes, though it definitely grabs attention! You may need to revise your views..

cma2y ago

It beats llama on the benchmark posted below (though maybe leaked into training data). But also you can run it on cheaper split up hardware with less individual vram than the big llama.

What makes it you think it's not as good as LLaMA? It's likely much better. There are multiple open-weight models that are better than LLaMA 2 out there already.

angilly2y ago· 4 in thread

The lack of a corresponding announcement on their blog makes me worry about a Twitter account compromise and a malicious model. Any way to verify it’s really from them?

simonw2y ago

Their https://twitter.com/MistralAI account has 5 tweets since the account opened, three of which were model release magnet links.

https://twitter.com/MistralAILabs is their other Twitter account, which is very slightly more useful though still very low traffic.

swyx2y ago

you must be new to mistral releases. they invented the magnet first blog later meta

At 3:30a France local? Alrighty. I still wait a lil bit ;)

This is how they released every model so far.

zmmmmm2y ago· 3 in thread

A pre-Llama3 race for everyone to get their best small models on the table?

moffkalast2y ago

262 GB is not exactly small. But yes it seems they're all getting them out the door in case they end up being worse than llama-3 in which case it'll be too embarrassing to release later.

hmottestad2y ago

Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB.

I think it might be the end for 24GB 4090 cards though :(

swyx2y ago

this is likely v true given llama 3 rumored to release in next 2 weeks

swalsh2y ago· 3 in thread

Is this Mistral large?

Jackson__2y ago

Unlikely, this model has a max sequence length of 65k, while mistral large is 32k.

varunvummadiOP2y ago

Not sure trying to download the torrent and checking it out

For those of us without twitter, how many GB is the model?

ZeljkoS2y ago· 2 in thread

Here is the unofficial benchmark: https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/...

bevekspldnw2y ago

Wish it had GPT-4, that’s the one to beat still.

GuB-422y ago

It is there, not for all the benchmarks, but for those where it is included, GPT-4 scores much higher.

Not surprising since GPT-4 is still state-of-the-art and much bigger. Where Mistral has been particularly impressive is when you take the size of the model into account.

zone4112y ago· 2 in thread

Very important to note that this is a base model, not an instruct model. Instruct fine-tuned models are what's useful for chat.

haolez2y ago

What's the feeling of playing with a powerful base model? Will it just complete the prompt text like a continuation of it?

MPSimmons2y ago

Generally, yes, it literally just tries to predict the next token again and again and again.

This model is apparently surprisingly good at chat, even though it is a base model, and will take part it it to some extent. It should be really interesting once it's fine-tuned.

stainablesteel2y ago· 2 in thread

has anyone had success making an auto-gpt concept for mistral/llama models? i haven't found one

Has anyone had success making an auto-gpt with any models? Besides toy use cases

danenania2y ago

I built one using GPT-4[1]. It's not perfect but is working quite well and is now being used by hundreds of users, apart from me, to work on real, non-toy tasks. For example, I used it to build most of a production-ready AWS infrastructure (and accompanying deploy script) with the AWS CDK.

I want to add Mistral support soon, probably via together.ai or a similar service.

1 - https://github.com/plandex-ai/plandex

talsperre2y ago· 1 in thread

Right on time as LLama 3 is released.

jimmySixDOF2y ago

And the same day Google Gemini Pro gets almost complete open long context multimodal access and OpenAI upgrade to GPT4-Turbo it was a big day in general for news drops that's for sure!

intellectronica2y ago· 1 in thread

It's weird that more than a day after the weights dropped, there still isn't a proper announcement from Mistral with a model card. Nor is it available on Mistral's own platform.

tosh2y ago

at least they confirmed it is Apache 2.0

https://twitter.com/arthurmensch/status/1778308399144333411

tjtang20192y ago· 1 in thread

What are the advantages compared to GPT? Looking forward to using it!

qball2y ago

>What are the advantages compared to GPT?

It actually does what you tell it, and won't try to silently change your prompt to conform to a specific flavor of Californian hysterics, which is what OpenAI's products do.

Also, since it's a local model, your queries aren't being datamined nor can access to the service be revoked on a whim.

SushiHippie2y ago

[dupe] https://news.ycombinator.com/item?id=39986047

Which has the link to the tweet instead of the profile:

https://twitter.com/MistralAI/status/1777869263778291896

nen-nomad2y ago

Mixtral 8x7b has been good to work with, and I am looking forward to trying this one as well.

Weird, the last post I see at that link is from the 8th of December 2023 and it's not about this.

Edit: Ah, it's the wrong link. https://news.ycombinator.com/item?id=39986047

Thanks SushiHippie!

varunvummadiOP2y ago

They Just announced their new model on Twitter, which you can download using torrent

j / k navigate · click thread line to collapse