I only got the 30b model running on a 4 x Nvidia A40 setup though.
Is there a sub/forum/discord where folks talk about the nitty-gritty?
it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.
And here are some benchmarks running OPT-175B purely on (a very beefy) CPU machine. Note that the biggest llama model is only 65.2B: https://github.com/FMInference/FlexGen/issues/24
Looking forward to the YouTube videos of random tinkerers seeing what sort of performance they can squeeze out of cheaper hardware.