undefined | Better HN

0 pointsjcgrillo1y ago0 comments

> Cost as in, cost to you? Or cost to serve?

This. IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user and perform hundreds of TB worth of computations per query.

How much would I have to charge for this? Are there any products where the users would actually get enough value out of it to pay what it costs?

Compare to the cost of a user session in a normal database backed web app. Even if that session fans out thousands of backend RPCs across a hundred services, each of those calls executes in milliseconds and requires only a fraction of the LLM's RAM. So I can support thousands of concurrent users per node instead of one.

0 comments

6 comments · 2 top-level

simonw1y ago· 3 in thread

The efficiency gains over the past 18 months have been incredible. Turns out there was a lot of low hanging fruit to make these things faster, cheaper and more resource efficient. https://simonwillison.net/2024/Dec/31/llms-in-2024/#llm-pric...

jcgrilloOP1y ago

Interesting. There's obviously been a precipitous drop in the sticker price, but has there really been a concomitant efficiency increase? It's hard to believe the sticker price these companies are charging has anything to do with reality given how massively they're subsidized (free Azure compute, billions upon billions in cash, etc). Is this efficiency trend real? Do you know of any data demonstrating it?

simonw1y ago

I have personal anecdotal evidence that they're getting more efficient: I've had the same 64GB M2 laptop for three years now. Back in March 2023 it could just about run LLaMA 1, a rubbish model. Today I'm running Mistral Small 3 on the same hardware and it's giving me a March-2023-GPT-4-era experience and using just 12GB of RAM.

People who I trust in this space have consistently and credibly talked about these constant efficiency gains. I don't think this is a case of selling compute for less than it costs to run.

skydhash1y ago

People are comparing the current rush to the investments made in the early days of the internet while (purposely?) forgetting how expensive access to it was back then. Not saying that AI companies should make a profit today, but I don't see or hear that AI usage is becoming essentials in any way or form.

1 more reply

jsnell1y ago· 1 in thread

> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.

The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.

(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)

> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user

The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.

> How much would I have to charge for this?

Empirically, as little as 0.00001 cents per token.

For context, the Bing search API costs 2.5 cents per query.

jcgrilloOP1y ago

Ah got it, that's more sensible. So is anyone making money with these things yet?

j / k navigate · click thread line to collapse

0 comments

6 comments · 2 top-level

simonw1y ago· 3 in thread

jcgrilloOP1y ago

simonw1y ago

People who I trust in this space have consistently and credibly talked about these constant efficiency gains. I don't think this is a case of selling compute for less than it costs to run.

skydhash1y ago

1 more reply

jsnell1y ago· 1 in thread

> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.

> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user

The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.

> How much would I have to charge for this?

Empirically, as little as 0.00001 cents per token.

For context, the Bing search API costs 2.5 cents per query.

jcgrilloOP1y ago

Ah got it, that's more sensible. So is anyone making money with these things yet?

j / k navigate · click thread line to collapse