undefined | Better HN

0 pointsMichaelRazum2y ago0 comments

If I may ask, how do you load such big models? 300gb seems like a lot to play around with.

0 comments

You're right, this model is going to be too big for most people to play around with. But to answer your question I have a 128GB of RAM in my M3 MacBook Pro, so I can use most of that for GPU inferencing. But still, this model is going to need to be heavily quantized for me to be able to use it. (fwiw, I probably wont try this one)

In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it. I suspect my computer might be able to run a 3 bit quant, but it might need to go down to 2 bits to have any kind of reasonable context length. But with quants that small I'd expect the model's performance to degrade well below that of Mixtral, so it probably isn't really even worth using. But we'll see; quantization is weird, some models perform better than others when quantized.

TMWNN2y ago

>In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?

Me10002y ago

Ollama is just a wrapper around llama.cpp, so when the gguf model files come out it'll be able to run on Ollama (assuming no llama.cpp patch is needed, but even if it is ollama is usually good at getting those updates out pretty quickly).

cjbprime2y ago

Few days max.

MichaelRazumOP2y ago

Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.

Me10002y ago

No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

Apple had plenty of reasons to move forward with their Apple Silicon CPUs and GPUs in the mac, but they really did seem to get lucky with the unified memory architecture. It was kind of just an artifact of their design, but ends up serving the needs of deep neural net models really well!

zozbot2342y ago

A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.

Me10002y ago

MoE doesn’t really help with the memory requirements for the reason mentioned in the other comment. But it does help with reducing the compute needed per inference. Which is good because the M3 Max and M2 Ultra don’t have the best GPUs. A 70B parameter model is pretty slow on my M3 Max, and this model has 86B activations per inference run.

EgoIncarnate2y ago

Each token generated may only use a subset of the parameters (86billion instead of 314billion), but the next generated token might use a different subset. If it's anything like Mixtral, it will switch between experts constantly. It helps with memory bandwidth, but all the parameters still need to be in RAM or it would be unbearably slow.

j / k navigate · click thread line to collapse

0 comments

Me10002y ago

TMWNN2y ago

>In the next week or two I expect we'll see a GGUF version of the weights (might need to wait for a patch to llama.cpp first), and someone will release super small quantizations of it.

How quickly are new models available through Ollama?

Me10002y ago

cjbprime2y ago

Few days max.

MichaelRazumOP2y ago

Thanks a lot for the hint :)! It awesome that it might run even on a MacBook, actually this is a reason to switch to Mac. Seems, there is nothing similar for a PC laptop with linux or windows.

Me10002y ago

No problem. I hope more people try these things out, it's the best way to push the industry forward! We can't let the researchers have all the fun.

zozbot2342y ago

A top-of-the-line Mac Studio Ultra maxes out at 192GB currently. This is also a MoE model, so only a fraction of parameters have to be in RAM.

Me10002y ago

EgoIncarnate2y ago

j / k navigate · click thread line to collapse