Where could I get a mapping of token / time vs hardware?
Here are some TGI 405B benchmarks that I did with the different quantized models:
https://x.com/danieldekok/status/1815814357298577718
The 405B model is very useful outside direct use in inference though. E.g. for generating synthetic data for training smaller model:
The $10k figure is likely roughly the minimum amount of money/hardware that you'd need to run the model at acceptable speeds, as anything less requires you to compromise heavily on GPU cores (e.g. Tesla P40s also have 24GB of VRAM, for half the price or less, but are much slower than 3090s), or run on the CPU entirely, which I don't think will be viable for this model even with gobs of RAM and CPU cores, just due to its sheer size.
I would be curious to see relative failure rates over time of consumer vs Quadro cards as well.