undefined | Better HN

0 pointsswyx2y ago0 comments

do you mind teaching how to do CPU/GPU RAM math? all i know is 34B 16bit = 68GB total RAM needed (because 1B of 8bytes = 1GB definitionally), but i dont know how it splits between CPU/GPU and whether the tradeoff in tok/s is acceptable

0 comments

2 comments · 2 top-level

brucethemoose22y ago

If you are doing ~4 bit quantization, a good rule of thumb is just under 1 Gigabyte per 1B parameters, plus a little room for the operating system. Longer contexts require a bit more VRAM.

For reference, 4 bit LlamaV1 33B fits snugly on a 24GB GPU with 2K context with the exLLaMA backend. But it won't do really long inputs.

Llama.cpp is pretty much the only backend that can offload to CPU efficiently. Its still quite fast and offers very flexible 3-5 bit quantization, with the leanest 3 bit quant just barely fitting LlamaV1 33b on my 6GB + 16GB laptop.

sbierwagen2y ago

People running LLMs on CPU are generally running them integer quantized, so they use fewer bits per parameter.

j / k navigate · click thread line to collapse