undefined | Better HN

0 pointshmottestad2y ago0 comments

Since it’s a MOE model it will only need to load a few of the 8 sub models into vram in order to answer a query. So it may look large, but I think a quantized model will easily fit on a Mac with 64GB of memory and maybe even a bit fewer bits and it’ll fit into 32GB.

I think it might be the end for 24GB 4090 cards though :(

0 comments

7 comments · 4 top-level

mark_l_watson2y ago· 2 in thread

I think you are an optimist here. I can barely run mixtral-8x-7B on my M2 Pro 32G Mac, but I am grateful to be able to run it at all.

JanisErdmanis2y ago

Which quantization level are you using?

mark_l_watson2y ago

Q2, so not so great. I usually run other models. I would be embarrassed to tell you how long my “ollama list” is.

brandall102y ago· 1 in thread

Unless something has changed, it needs to load the full 8 models at the same time. During inference it performs like a 2 x base model.

Mixtral 7B @ 5 bit takes up over 30gb on my M3 Max. That's over 90 for this at the same quantization. Realistically you probably need a 128gb machine to run this with good results.

fzzzy2y ago

A 4 bit quant of the new one would still be about 70 gb, so yeah. Gonna need a lot more ram.

dragonwriter2y ago

MOE models don’t, in practice, selectively load experts on activation (and if a runtime for them could be designed that would do that, it would make them perform worse, since the experts activated may differ from token to token, so you’d be churning a whole lot swapping portions of the model into and out of VRAM.) But they do less computation per token for their size than monolithic so you can often get tolerable performance on CPU or split between GPU/CPU at a ratio that would work poorly with a similarly-sized monolithic model.

But, still, its going to need 262GB for weights + a variable amount based on context without quantization, and 66GB+ at 4-bit quantization.

Kubuxu2y ago

The 8x is misleading; there are 8 sets of weights (experts) per token and per layer. If it is similar to the previous MoE Mistral models, then two experts get activated per token per layer. This reduces the amount of compute and memory bandwidth you need to perform inference but doesn't reduce the amount of memory you need as you cannot load the experts into GPU memory on demand without performance impact.

j / k navigate · click thread line to collapse