> Costs and pricing are expressed per “token”, but the published data immediately seems to admit that this is a bad choice of unit because it costs a lot more to output a token than input one. It seems to me that the actual marginal quantity being produced and consumed is “processing power”, which is apparently measured in gigawatt hours these days. In any case, I think more than anything this vindicates my original decision not to get too precise. [...]
https://backofmind.substack.com/p/new-new-rules-for-the-new-...
Is it priced that way, though? I assume next-gen TPU's will be more efficient?
And, that's silly, because API pricing is more expensive for output than input tokens, 5x so for Anthropic [1], and 6x so for OpenAI!
[1] https://platform.claude.com/docs/en/about-claude/pricing
Large outputs dominate compute time so are more expensive.
IMO input and output token counts are actually still a bad metric since they linearise non linear cost increases and I suspect we’ll see another change in the future where they bucket by context length. XL output contexts may be 20x more expensive instead of 10x.
For Anthropic, as a business bleeding money, it's probably nice to have value-based pricing, for the tokens, so innovation (like computation efficiency improvements) can result in some extra margin. If they exposed the more direct computation cost, they could never financially benefit from any improved efficiency, including faster hardware!
This is a bit too much of a simplification.
The LLM provider batches multiple customer requests into one GPU/TPU pass over the weights, with minimal latency increase.
The LLM provider may in fact be renting GPUs by the second, but the end user isn't. We the end users are essentially timesharing a pool of GPUs without any dedicated "1 vGPU" style resource allocation. In such a setting, charging by "GPU tick" sounds valid, and the various categories of token costs are an approximation of cost+margin.
I’m assuming you can cram more chips in there if you have more efficient chips to make use of spare capacity?
Trying to measure the actual compute is a moving target since you’d be upgrading things over time, whereas the power aspects are probably more fixed by fire code, building size, and utilities.
The equivalent of cars would be pricing by how much gas you burned, not horsepower.
Compare what you need to add to AWS EC2 to get the same result, above and beyond the electricity.
I often get a 10x more cost effective run processing on my local hardware.
Still reaching for frontier models for coding, but find the hosted models on open router good enough for simple work.
Feels like we are jumping to warp on flops. My cores are throttled and the fiber is lit.
Feels like the lede is buried here!
Yeah I've seen stuff like that and it's a bit bewildering for me. Feels a bit like AWS is new and we're competing to see who can deploy the most EC2 instances.
So I don't think those numbers are really in tension at all
They're not quite growing that fast, but there's nothing inherently inconsistent between these claims... as long as the growth curve is crazy.
1) It's in their interest to distort numbers and frame things that make them look good - e.g. using 'run-rate' 2) The numbers are not audited and we have no idea re. the manner in which they are recognising revenue - this can affect the true compounding rate of growth in revenues
I think Anthropic is a more grounded company than OpenAI because Sam Altman is insane, but it is still playing the same game.
It's just not material. Broadcom make devices they need, and Broadcom want to sell those devices and exclude another VLSI company from selling, so the two have an interest in doing business. That's all there is to it.
About the most you could say is that the lawyers drafting whatever agreement they sign to, will reflect on the contract in regard to future changes of pricing and supply, in the light of what Broadcom did with VMWare licencing costs.
Surely, there should be some more critical questions posed by why just buying a bunch of GPUs is a good idea? It just feels like a cheap way to show that growth is happening. It feels a bit much like FOMO. It feels like nobody with the capital is questioning whether this is actually a good idea or even a desirable way to improve AI models or even if that is money well spent. 1 GW is a lot of power. My understanding is that it is the equivalent to the instantaneous demand of a city like Seattle. This is absurd.
It feels like there is some awareness that asking for gigawatts if not terrawatts of compute probably needs more justification than has been proffered and the big banks are already trying to CYA themselves by publishing reports saying AI has not contributed meaningfully to the economy like Goldman Sachs recently did.
it's kind of like an electrical motor that exists before the strong understanding of lorentz/ohm's law. We don't really know how inefficient the thing is because we don't really know where the ceiling is aside from some loosey theoretical computational efficiency concepts that don't strongly apply to practical LLMs.
to be clear, I don't disagree that it's the limiting factor, just that 'limits' is nuanced here between effort/ability and raw power use.
"Do you realize that the human brain has been liken to an electronic brain? Someone said and I don't know whether he is right or not, but he said, if the human brain were put together on the basis of an IBM electronic brain, it would take 7 buildings the size of the Empire State Building to house it, it would take all the water of the Niagara River to cool it, and all of the power generated by the Niagara River to operate it." (Sermon by Paris Reidhead, circa 1950s.[1])
We're there on size and power. Is there some more efficient way to do this?
[1] https://www.sermonindex.net/speakers/paris-reidhead/the-trag...
1. Opus and Sonnet.
2. Compute capacity. Anthropic has much more of it than your average coding startup.
3. The developing ecosystem around Claude Code.
It looks to me that Anthropic is one or two Gemmas away from a lot of people using Opus for 20% of hard use cases and letting on-device LLM rip through the code base on a Mac Mini or Studio and OpenCode.
Once Claude Code is not the only game in town and Cowork is made redundant by Google pulling their finger out on integration with Workspace, what else is there for Anthropic?
Where open models can make a difference for agentic use is with third-party inference at scale, which can actually be fast enough for reasonable workflows.
/s
OpenCode: you pay per token.
Claude Code: you pay a flat fee.
Claude Code Enterprise: you pay per token
And the subscription is not Anthropic's moat either since it's likely heavily subsidized. They're just using it to acquire customers.
The moat is locking you into Anthropic's model particularities (extended thinking, getting you into their "mindset", etc.)
- reducing the surface area of "acceptable use" (e.g., blocking third-party tools OpenClaw)
- tighter usage limits and more subscription tiers
- increasing existing subscription prices
- moving to usage based model completely
- taking away compute from training next gen models (future demand destruction)
Edit: What we have built is a natural language interface to existing, textually recorded, information. Transformers cannot learn the whole universe because the universe has not yet been recorded into text.
Strangely some people on HN seem to desperately cling to the notion that it's all going to come to a halt. This is unscientific. What evidence do you have - any evidence - that the scaling laws are due to come to an end?