undefined | Better HN

0 pointsnullc3y ago0 comments

> The hardware costs alone of running something like GPT 3.5 for real time results is 6-7 figures a year.

Can you unpack your thinking there? Even at 5% interest for ownership costs to be six figures a year you're talking about millions of dollars in hardware. Inference is just not that expensive, not even with gigantic models.

To the extent that there is operating cost (e.g. energy)-- that isn't generated when the system is offline.

I don't know how big GPT 3.5 is, but I can _train_ LLaMA 65B on hardware at home and it is nowhere near that expensive.

0 comments

2 comments · 1 top-level

cbzoiav3y ago· 1 in thread

https://twitter.com/tomgoldsteincs/status/160019698195510069...

That's 8 $200k GPUs + all the other hardware + power consumption for one instance. You could run it on cheaper hardware, but then you'll get to nowhere near realtime output which is required for the majority of the use cases not already handled well by much smaller models.

Even if Google/Microsoft are getting the hardware at a 50% reduction (bearing in mind these are already not consumer prices) it gets to $1mn in hardware alone - again for a single instance that can handle one user interacting with it at a time.

It makes a lot of the bespoke usecases people are getting excited about (i.e. anything with data privacy concerns) far from financially viable.

If you want a dedicated instance of full capability ChatGPT for example (32K content) OpenAI are charging $468k for a 3 month commitment / $1,584k for a year.

nullcOP3y ago

You can purchase 80GB A100s right now for about $12,5k on the open market. I think the list price is $16k. I don't know what discount the big purchasers see, but 30% should be table stakes (probably explains that $12.5k prices), 50% for the big boys wouldn't be at all surprising to me based on my experience with other computing hardware.

So under the assumption that 8 80GB gpus are required, we're talking about a somewhat more than $100k one time cost (for 8x 80gb A100 plus the host) plus power, not 6-7 figures annually. Huge difference!

Evaluating it in a latency limited regime but without enough workload to enable meaningful batching is truly a worst case. I admit that there are applications where you're stuck with that, but there are plenty that aren't.

Anyone in that regime should try to figure out how to get out of it. E.g. concurrently generating multiple completions can sometimes help you hide latency, at least to the extent that you're regenerating outputs because you were unhappy with the first sample.

> that can handle one user interacting with it at a time.

That bit I don't follow. The argument given there is without batching. You can do N samples concurrently at far less than N times the cost.

> OpenAI are charging

Ah the joys of having a monopoly!

j / k navigate · click thread line to collapse