Also, what's the best way to benchmark a model to compare it with others? Are there any tools to use off-the-shelf to do that?
You would have to confirm with someone deeper in the ecosystem, but I think you should be able to run this new model as is against a llamafile?
My recent work optimizing CPU evaluation https://justine.lol/matmul/ may have come at just the right time. Mixtral 8x7b always worked best at Q5_K_M and higher, which is 31GB. So unless you've got 4x GeForce RTX 4090's in your computer, CPU inference is going to be the best chance you've got at running 8x22b at top fidelity.
Really easy to search huggingface for new models to test directly in the app.
I’m sure they are already working on it.
https://api.together.xyz/playground/language/mistralai/Mixtr...
Which has the link to the tweet instead of the profile:
Why would you want another 8x7b, if you already have it ...
Language support is one big thing that is missing from open models. I’ve only found one model that can do anything useful with Norwegian, which has never been an issue GPT-4.
I think it might be the end for 24GB 4090 cards though :(
Not surprising since GPT-4 is still state-of-the-art and much bigger. Where Mistral has been particularly impressive is when you take the size of the model into account.
But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch.
Really though if you're just looking to run models personally and not finetune (which requires monstrous amounts of VRAM), Macs are the way to go for this kind of mega model: Macs have unified memory between the GPU and CPU, and you can buy them with a lot of RAM. It'll be cheaper than trying to buy enough GPU VRAM. A Mac Studio with 192GB unified RAM is under $6k — two A6000s will run you over $9k and still only give you 96GB VRAM (and God help you if you try to build the equivalent system out of 4090s or A100s/H100s).
Or just rent the GPU time as needed from cloud providers like RunPod, although that may or may not be what you're looking for.
https://www.reddit.com/r/LocalLLaMA/comments/18ituzh/mixtral...
This model is apparently surprisingly good at chat, even though it is a base model, and will take part it it to some extent. It should be really interesting once it's fine-tuned.
For example on EQbench[0], Miqu[1], a leaked continued pretrain based on LLama2, performs extremely similar to the mistral medium model their API offers.
Maybe they're thinking it'd be bad PR for them to release models they didn't create from scratch, or there is some contractual obligation preventing the release.
> Our mission is to make frontier AI ubiquitous, and to provide tailor-made AI to all the builders. This requires fierce independence, strong commitment to open, portable and customisable solutions, and an extreme focus on shipping the most advanced technology in limited time.
Edit: Ah, it's the wrong link. https://news.ycombinator.com/item?id=39986047
Thanks SushiHippie!
Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.
If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.
I agreed for the first one, (46.3 - 7) / 7 = 5.61b.
The second one doesn't match up, (281 - 22) / 7 = 37b or (140.5 - 22) / 7 = 16.92b. Am I doing something wrong?
This is clearly an inferior model that they are willing to share for marketing purposes.
If it was an improvement over llama, sure, but it seems like just an ad for bad AI.
In fact I would go as far as saying llama2 isn’t that good compared to some of the most recent models.
I want to add Mistral support soon, probably via together.ai or a similar service.
https://twitter.com/MistralAILabs is their other Twitter account, which is very slightly more useful though still very low traffic.
It actually does what you tell it, and won't try to silently change your prompt to conform to a specific flavor of Californian hysterics, which is what OpenAI's products do.
Also, since it's a local model, your queries aren't being datamined nor can access to the service be revoked on a whim.