undefined | Better HN

0 pointskingsleyopara1y ago0 comments

You might be able to get away with running a heavily quantizied 405b model using CPU inference at a blistering fast token every 5 seconds on a 7950x.

0 comments

wuschel1y ago

OK, I am curious now: What kind of hardware would I need to run such a model for a couple of users with decent performance?

Where could I get a mapping of token / time vs hardware?

microtonal1y ago

You can run the 4-bit GPTQ/AWQ quantized Llama 405B somewhat reasonably on 4x H100 or A100. You will be somewhat limited in how many tokens you can have in flight between requests and you cannot create CUDA graphs for larger batch sizes. You can run 405B well on 8x H100 and A100, either with the mixed BFloat16/FP8 checkpoint that Meta provided or GPTQ/AWQ-quantized models. Note though that the A100 does not have native support for FP8, but FP8 quantized weights can be used through the GPTQ-Marlin FP8 kernel.

Here are some TGI 405B benchmarks that I did with the different quantized models:

https://x.com/danieldekok/status/1815814357298577718

The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:

https://huggingface.co/blog/synthetic-data-save-costs

coconut081y ago

how much vram do you need for 4-bit llama 405?

1 more reply

angoragoats1y ago

Unsure if anyone has specific hardware benchmarks for the 405b model yet, since it's so new, but elsewhere in this thread I outlined a build that'd probably be capable of running a quantized version of Llama 3.1 405b for roughly $10k.

The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.

bick_nyers1y ago

Energy costs are an important factor here too. While Quadro cards are much more expensive upfront (higher $/VRAM), they are cheaper over time (lower Watts/Token). Offsetting the energy expense of a 3090/4090/5090 build via solar complicates this calculation but generally speaking can be a "reasonable" way of justifying this much hardware running in a homelab.

I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.

2 more replies

j / k navigate · click thread line to collapse