undefined | Better HN

0 pointshmottestad1y ago0 comments

I’ve run the Falcon 180B on my M3 Max with 128 GB of memory. I think I ran it at 3-bit. Took a long time to load and was incredibly slow at generating text. Even if you could load the Llama 405B model it would be too slow to be of much use.

0 comments

buu7001y ago

Ah, that's a shame to hear. FWIW, ChatGPT did also suggest that there was a lot of room for improvement in the MPS backend of PyTorch that would likely make it more efficient on Apple hardware in time.

Klaus231y ago

You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.

A 405B LLM has 405 billion parameters. If you run it at full "prescision", each parameter takes up 2 bytes, which means you need 810GB of memory. If it does not fit in RAM or GPU memory it will swap to disc and be unusably slow.

You can run the model at reduced prescision to save memory, called quantisation, but this will degrade the quality of the response. The exact amount of degradation depends on the task, the specific model and its size. Larger models seem to suffer slightly less. 1 byte per parameter is pretty much as good as full precision. 4 bits per parameter is still good quality, 3 bits is noticeably worse and 2 bits is often bad to unusable.

With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.

There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.

buu7001y ago

I appreciate the additional information, but I'm not sure what you're claiming is a fundamental misunderstanding on my part. I was referring to running the model with quantization, and was clear that I hadn't verified the accuracy of the claims.

The comment about the MPS PyTorch backend was related to performance, not whether the model would fit at all. I can't say whether it's accurate that the MPS backend has significant room for optimization, but it is still publicly listed as in beta.

1 more reply

nl1y ago

Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.

For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

buu7001y ago

Ah okay, interesting, thanks.

j / k navigate · click thread line to collapse

0 comments

buu7001y ago

Klaus231y ago

You fundamentally misunderstand the bottleneck of large LLMs. It is not really possible to make gains that way.

With 128GB of RAM, zero overhead and a 405B model, you would have to quantize to about 2.5 bits, which would noticeably degrade the response quality.

There is also model pruning, which removes parameters completely, but this is much more experimental than quantisation, also degrades response quality, and I have not seen it used that widely.

buu7001y ago

1 more reply

nl1y ago

Most "local model runners" (Llama.CPP, Llama-file etc) don't use Pytorch and instead implement the neural network directly themselves optimized for whatever hardware they are supporting.

For example here's the list of backends for Llama.cpp: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#su...

buu7001y ago

Ah okay, interesting, thanks.

j / k navigate · click thread line to collapse