You can purchase 80GB A100s right now for about $12,5k on the open market. I think the list price is $16k. I don't know what discount the big purchasers see, but 30% should be table stakes (probably explains that $12.5k prices), 50% for the big boys wouldn't be at all surprising to me based on my experience with other computing hardware.
So under the assumption that 8 80GB gpus are required, we're talking about a somewhat more than $100k one time cost (for 8x 80gb A100 plus the host) plus power, not 6-7 figures annually. Huge difference!
Evaluating it in a latency limited regime but without enough workload to enable meaningful batching is truly a worst case. I admit that there are applications where you're stuck with that, but there are plenty that aren't.
Anyone in that regime should try to figure out how to get out of it. E.g. concurrently generating multiple completions can sometimes help you hide latency, at least to the extent that you're regenerating outputs because you were unhappy with the first sample.
> that can handle one user interacting with it at a time.
That bit I don't follow. The argument given there is without batching. You can do N samples concurrently at far less than N times the cost.
> OpenAI are charging
Ah the joys of having a monopoly!