undefined | Better HN

0 pointskoheripbal3y ago0 comments

Can I spend $5K and run it at home? What GPU(s) do I need?

0 comments

5 comments · 3 top-level

turmeric_root3y ago· 2 in thread

the 7B model runs on a CUDA-compatible card with 16GB of VRAM (assuming your card has 16-bit float support).

I only got the 30b model running on a 4 x Nvidia A40 setup though.

q1w23y ago

The 30B is 64.8GB and the A40s have 48GB NVRAM ea - so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

Is there a sub/forum/discord where folks talk about the nitty-gritty?

turmeric_root3y ago

> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.

1 more reply

gpm3y ago

In principal you can run it on just about any hardware with enough storage space. It's just a question of how fast it will run. This readme has some benchmarks with a similar set of models (and the code has support for even swapping data out to disk if needed): https://github.com/FMInference/FlexGen

And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24

px433y ago

As the models proliferate, I guess we'll be finding out soon. The torrent has been going pretty slow for me for the past couple hours, but it looks like there are a couple seeders, so eventually it'll hit that inflection point where there are enough seeders to give all the leechers full speed downloads.

Looking forward to the YouTube videos of random tinkerers seeing what sort of performance they can squeeze out of cheaper hardware.

j / k navigate · click thread line to collapse