Why is Chat GPT so expensive to operate?

82 pointsbeavis0003y ago52 comments

Altman has said "it's a few cents per chat", which probably means it closer to high single digit cents per chat. Does that estimate include amortization of upfront development costs, or is it actually the marginal cost of a chat?

52 comments

37 comments · 13 top-level

vineyardmike3y ago· 10 in thread

All these answers are good, but I can share more concrete numbers…

Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy.

Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.

OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.

https://alpa.ai/tutorials/opt_serving.html

mr_00ff003y ago

Really makes you appreciate the brain, which presumably operates with some sort of similar demand.

unsupp0rted3y ago

Hard to tell. Similar to how it takes a lot of resources for a human to hang from monkey bars but for a sloth it takes basically no resources at all, because the sloth comes out of the box designed for it.

2 more replies

smnrchrds3y ago

Another mind-boggling thing about brain is how little power it uses to do all the complex things it does.

1 more reply

imtringued3y ago

The brain doesn't use a synchronous digital architecture. It is asynchronous. Spiking neural networks implemented in neuromorohic hardware are equally efficient. They consume milliwatts for a million neurons.

1 more reply

iosystem3y ago

How do you know that the universe isn't just rendering everything.

_8j503y ago

To have ML produce meaningful content you need tp give it some input or a sense of what the outcome should be and this is after billions of trial and errors.

Yet people these days believe something like the brain was bruteforced by nature into an accidental existence.

2 more replies

ramblerman3y ago

It's interesting that the requirements for a text model are so much greater than for images.

Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition.

sadpasture3y ago

I think it has to do with text being much more precise. Your stably diffused cartoon avatar having 6 finger is not nearly as noticeable as a language model's chat mispelling every second word. So you need less resources to get to a human acceptable result

1 more reply

mike_hearn3y ago

Don't forget training costs, labor costs used for RLHF and (most likely) the money required for such large volumes of training data.

hansvm3y ago

Doesn't ChatGPT fine-tune one of the smaller GPT-3s, not the 175B parameter model?

thealch3m1st3y ago· 3 in thread

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

sidibe3y ago

Does dojo even exist? He kept talking about how it was almost ready a couple years ago, no word since which is strange from such a braggart

cypress663y ago

They unveiled it on AI day 2021, talked more about it on AI day 2022, and in theory should start operating Q1 2023.

thealch3m1st3y ago

It would be a good idea, no ?

DoesntMatter223y ago· 3 in thread

I think it's actually quite cheap for what it is

MuffinFlavored3y ago

what useful purpose have you found for ChatGPT given the “it can return inaccurate results posed as accurate” problem?

DoesntMatter223y ago

It does coding things extremely well. Are there some errors here or there at times? Yes but in general it does it excellently. I think this is a good example of not letting the perfect be the enemy of the good.

It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win.

Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that.

elbear3y ago

Have it do stuff you know how to do, just a lot faster. Or, even if you don't know exactly how to do it, check what it gave you to see if it produces expected results.

For example, it gives you code. You run that code to see if the outputs are as expected.

1 more reply

Jensson3y ago· 2 in thread

The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit.

I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.

trillic3y ago

an 80gb A100 is not $150k, more like $10-15k.

machinekob3y ago

But model need about 350gb so I'm not sure one A100 with 80gb will be enough?

ilaksh3y ago· 2 in thread

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.

Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

jejeyyy773y ago

Which seems insane considering stable diffusion can run on a M1 MacBook.

ilaksh3y ago

Sure but they are totally different algorithms doing different things.

1 more reply

scarface743y ago· 2 in thread

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?

Memory shipped with computers have been stagnate for a decade

faebi3y ago

Maybe that will be the next use case to make larger amounts of memory mainstream. At the same time, somehow Tesla still manages to cram more and more neural nets into that small memory. So it could also be that many neural nets are just not really efficient yet.

est313y ago

People are already trying to put cutting edge models onto consumer hardware: https://news.ycombinator.com/item?id=32678664

We live in a really exciting age :). Local AI models will also finally give Microsoft reasons again to require hardware for coming Windows versions. Now they have to require obscure security chips and stuff but in the future they might have some local cortana thingy or something that requires a certain amount of computational power.

z3r0k00l3y ago· 2 in thread

python

sinenomine3y ago

If you really measure what is being run, it is more likely well-optimized CUDA GPU assembly kernels - or, at this point - might be already some exotic TPU-like accelerator assembly.

This hubris over the top-level language in the system is so passe, so 2000s.

mulligan3y ago

at this scale, the ml models are usually compiled into a format that runs independent of python. so the answer isn't "python"

sdrg8223y ago

For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :

- run tokenization of inputs on CPU

- sort inputs by length

- batch inputs of similar length and apply padding to make of uniform length

- pass the batches through so a single model can process many inputs in parallel.

For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).

Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.

Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.

The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.

If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time

JoeyBananas3y ago

> Does that estimate include amortization of upfront development costs?

The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

toomuchtodo3y ago

https://twitter.com/tomgoldsteincs/status/160019698195510069...

https://threadreaderapp.com/thread/1600196981955100694.html

jhoelzel3y ago

Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.

Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.

You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D

sinenomine3y ago

The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.

Read the papers.

lee1013y ago

Basically gpu/compute costs being so expensive. Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try https://text-generator.io It also analyses images which OpenAI doesn't do

j / k navigate · click thread line to collapse

52 comments

37 comments · 13 top-level

vineyardmike3y ago· 10 in thread

All these answers are good, but I can share more concrete numbers…

OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.

https://alpa.ai/tutorials/opt_serving.html

mr_00ff003y ago

Really makes you appreciate the brain, which presumably operates with some sort of similar demand.

unsupp0rted3y ago

2 more replies

smnrchrds3y ago

Another mind-boggling thing about brain is how little power it uses to do all the complex things it does.

1 more reply

imtringued3y ago

1 more reply

iosystem3y ago

How do you know that the universe isn't just rendering everything.

_8j503y ago

To have ML produce meaningful content you need tp give it some input or a sense of what the outcome should be and this is after billions of trial and errors.

Yet people these days believe something like the brain was bruteforced by nature into an accidental existence.

2 more replies

ramblerman3y ago

It's interesting that the requirements for a text model are so much greater than for images.

Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition.

sadpasture3y ago

1 more reply

mike_hearn3y ago

Don't forget training costs, labor costs used for RLHF and (most likely) the money required for such large volumes of training data.

hansvm3y ago

Doesn't ChatGPT fine-tune one of the smaller GPT-3s, not the 175B parameter model?

thealch3m1st3y ago· 3 in thread

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

sidibe3y ago

Does dojo even exist? He kept talking about how it was almost ready a couple years ago, no word since which is strange from such a braggart

cypress663y ago

They unveiled it on AI day 2021, talked more about it on AI day 2022, and in theory should start operating Q1 2023.

thealch3m1st3y ago

It would be a good idea, no ?

DoesntMatter223y ago· 3 in thread

I think it's actually quite cheap for what it is

MuffinFlavored3y ago

what useful purpose have you found for ChatGPT given the “it can return inaccurate results posed as accurate” problem?

DoesntMatter223y ago

It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win.

Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that.

elbear3y ago

Have it do stuff you know how to do, just a lot faster. Or, even if you don't know exactly how to do it, check what it gave you to see if it produces expected results.

For example, it gives you code. You run that code to see if the outputs are as expected.

1 more reply

Jensson3y ago· 2 in thread

trillic3y ago

an 80gb A100 is not $150k, more like $10-15k.

machinekob3y ago

But model need about 350gb so I'm not sure one A100 with 80gb will be enough?

ilaksh3y ago· 2 in thread

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.

Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

jejeyyy773y ago

Which seems insane considering stable diffusion can run on a M1 MacBook.

ilaksh3y ago

Sure but they are totally different algorithms doing different things.

1 more reply

scarface743y ago· 2 in thread

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?

Memory shipped with computers have been stagnate for a decade

faebi3y ago

est313y ago

People are already trying to put cutting edge models onto consumer hardware: https://news.ycombinator.com/item?id=32678664

z3r0k00l3y ago· 2 in thread

python

sinenomine3y ago

If you really measure what is being run, it is more likely well-optimized CUDA GPU assembly kernels - or, at this point - might be already some exotic TPU-like accelerator assembly.

This hubris over the top-level language in the system is so passe, so 2000s.

mulligan3y ago

at this scale, the ml models are usually compiled into a format that runs independent of python. so the answer isn't "python"

sdrg8223y ago

For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :

- run tokenization of inputs on CPU

- sort inputs by length

- batch inputs of similar length and apply padding to make of uniform length

- pass the batches through so a single model can process many inputs in parallel.

Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.

JoeyBananas3y ago

> Does that estimate include amortization of upfront development costs?

The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

toomuchtodo3y ago

https://twitter.com/tomgoldsteincs/status/160019698195510069...

https://threadreaderapp.com/thread/1600196981955100694.html

jhoelzel3y ago

Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.

Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.

sinenomine3y ago

The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.

Read the papers.

lee1013y ago

j / k navigate · click thread line to collapse