Are the costs of AI agents also rising exponentially? (2025) (opens in new tab)

(tobyord.com)

305 pointslouiereederson24d ago130 comments

130 comments

While I understand why they used the METR data, a cleaner look would be against the current cost-optimal frontier of open models (e.g. GLM-5.1 and MiniMax-M2.7). That paints a very different picture. Comparing just the frontier models at the time of the METR report invariably leads to looking at providers who are pushing the limits of cost at the time of the report.

GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.

So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.

avidphantasm21d ago

Calculating hourly costs for these models makes me think that the decision of when to hire an SWE vs. increase use of AI may follow a similar pattern to the decision to use cloud compute vs. on-premises. I don’t cost $120/hr (incl. fringe), but my employer pays my salary all year long, no matter if I am working or on vacation. Whereas if they use an AI model to do the same work, they may be happy to pay $120/hr or more, since they may only use the model for a small fraction of 2080 hours per year, so they’d still save money, and not have a messy human to deal with.

bbatha21d ago

I remain convinced we won’t look at project estimates as time based in software engineering as our primary cost estimate. And this is transition will happen rapidly. We’re going to shift to a capex/token spend model for project estimates where the business will say “ok I do want that feature for $1000 in tokens”.

onchainintel21d ago

I agree with you directionally that project estimates are/will be affected by this but I don't see a scenario in which time is completely removed from the equation with respects to projects & estimates to execute on them. We're all constrained by time, finite resource. It's always a factor in business.

smusamashah21d ago

Once a model is stable and good enough, for example Sonnet 4.6 or GPT 5.4 (or something else in future), it can be burned into hardware like Talaas chip reducing the cost many times and increasing the speed. At some point we can rely on old model while being productive with it.

535188B17C9374321d ago

I always wondered why the equivalent of integrated mining didn't apply to LLM inference... now it turns out it does and there's a company making it fast and robust!

bee_rider21d ago

An ASIC for bitcoin mining makes more sense, in that the algorithm is basically “set.” For LLMs, it is hard while models are still developing.

But, sounds like Taalas is trying to strike an interesting balance where they can at least spin up ASICs for new models reasonably quickly with their modular design. It’s a really interesting bet, and might pay off.

radialstub21d ago

No, burning models into hardware won't make them faster or reduce the cost. It will cost way more for similar performance as what you would get with a gpu. I am not telling you why, you can go figure that out on your own.

smusamashah21d ago

But isn't this happening here https://taalas.com/ already. They have a demo of llama running at 17000 tokens per second https://chatjimmy.ai/

gjsman-100021d ago

With some research, that chip appears like it would cost about $300-$400 to manufacture, die only.

For an 8B parameter model.

Opus is estimated at 500B-2T parameters. At that scale you’re past reticle limits and need HBM and multi-die packaging, which means you’ve essentially built an inference ASIC (like Groq or Etched) rather than something categorically cheaper than GPUs. The “burned into silicon” advantage mostly evaporates at frontier scale.

3 more replies

margalabargala21d ago

You mean the person saying "I won't tell you why" might not know what they're talking about?! Say it ain't so.

pindab0ter20d ago

I just tried chatjimmy.ai for a bit and while it is absolutely blazingly fast, it's also not a very strong model. I suppose that with time, stronger models will be able to run on such hardware, too.

zipy12421d ago

The crazy part about this is if you compare it not to US wages but european, for instance in the UK where the median software hourly wage is somewhere around $35-40 an hour, then humans are already cheaper than the best models.

allan_s21d ago

that's what I'm telling my non-tech friends when they say "Looking to how fast AI progress and robotics, me as an electrician, will a robot soon take my job?" I reply to them "My job as a software engineer will be replaced sooner than yours, because for your job, the robot will be much more expensive than the minimal wave, and you don't need to buy a human"

drzaiusx1121d ago

I wouldn't discount the worries outside of tech however. A cheap human laborer that leans on AI to provide their checklist and described actions for tasks is definitely in scope to replace hard won hands on knowledge from experience in industry professionals. You no longer have to watch 30 YouTube videos to learn and distill a task as a layperson in a field involving manual labor.

I rebuilt my house from the studs, did my own electrical and plumbing, etc. This took a significant amount of training and research back in the day. I worked under my father for a decade before making this attempt. My father is a journeyman electrician and carpenter. I think any able bodied human could soon forgo much of that and simply get a breakdown of actions to perform in a particular order and get similar results.

darepublic21d ago

I tried to use gpt for various handy work. While it does help I don't think it can adequately substitute for hard won hands. Maybe next gen if you provide a video stream and the llm can view the exact situation. Even then though I wouldn't discount the difficulty of learning dexterity when you've been a coddled white collar worker your whole life

1 more reply

segmondy21d ago

Humans are not cheaper than AI models. Let's go with $35 an hour.

24365 = 8760 8760$35 = $306,600

Yeah, a human working non stop will run $300k.

Now you said, the "best" models. I personally reckon that 80-90% of most work don't need the best models. They need a good model, and good models are super cheap. i.e, the tiny gemma4 or qwen3.6 models will be sufficient for most of those work.

AI cloud usage cost goes up near linearly, but local cost doesn't. So say someone built an under $10k system, with perhaps dual RTX 5090. That same system will be able to easily run 20 parallel requests. The only cost is electricity. You can run it 24/7. For 1 year, that's ~$6million. 20 humans will also have overhead of electricity, real estate and other things which far exceed the cost of electricity for just AI.

The thing AI agents are lacking is agency and autonomy. As they get closer and closer, the majority of humans competing in the same sort of tasks will have no chance.

zipy12421d ago

I'm going by the graph in the original artical, not stating a point. I mean if their cost lines on the graph are to be believed, then the number I quoted is cheaper.

ac2921d ago

> So say someone built an under $10k system, with perhaps dual RTX 5090. That same system will be able to easily run 20 parallel requests. The only cost is electricity. You can run it 24/7. For 1 year, that's ~$6million

I dont see how you get anywhere close to $6M of tokens out of a pair of 5090s. The class of model they could run is fairly small and extremely cheap to run via API (my math says running Gemma4-31B for 24 hours costs less than $1 on OpenRouter). Even with 20x concurrent requests you are orders of magnitude away from $6M/yr.

segmondy21d ago

I never said that, my point is that paying 20 people at $35 for 24/7 is about $6 million. You can replace that with a $10k system running 20 parallel requests for a year and save lots of money.

rcbdev21d ago

Software developers average about 80 - 120€/h in Austria. How are UK rates so bad?

zipy12420d ago

A quick google seems to suggest the median software pay in Austria is about €60k. That translates to an hourly wage of roughly 31€/h .

Your range of 80-120 gives a yearly wage of like 155k-234k which is clearly far too high for an average.

rcbdev20d ago

If you start talking about salaried work, you have to account for Austria's 5 weeks of mandatory PTO, infinite paid sick days (with a pay reduction after 6 weeks, clock resets each year), additional paid care leave (Pflegefreistellung) for sick family, paid maternity / paternity leave, etc. in contrast with whatever the British system offers. For developers in Austria who choose salaried work, the math tends to add up with at least 2 months off each year, union protection etc.

You'd have to be an expert on labor law in every country you seek to honestly compare. The only clean way to simply compare earnings is freelance rates.

mike_hearn20d ago

The discrepancy is probably consulting vs salaried.

GorbachevyChase21d ago

I have a lot of AI written software, and it doesn’t cost me anywhere close to what I’ve been quoted for other software projects in the past. I’ve had a guy spend over six months, full-time, on a CRUD application for permits. He didn’t even finish. I made a working prototype in Django, which was tossed to re-implement in PHP for some reason.

avidiax21d ago

My understanding is that this is normalized to the "best human" for the tasks.

An AI only doing a task correctly 50% of the time may in-fact be better than your N% chance of hiring a highly capable human for that task, and especially for contracting a human to a 1-2 hour task.

But your successful use of AI is still predicated on a human who can judge output and break the work into smaller tasks that fit the skill ceiling of the AI, which is currently no more than tasks that take a skilled human 2 hours.

thelastgallon22d ago

> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.

nopinsight22d ago

Ord's frontier-cost argument is right as far as it goes, but the piece doesn't engage with the counter-trend: inference cost for a fixed capability level has been falling faster than Moore's law. Pushing the frontier will likely keep getting more expensive and concentrated among a few players, while the intelligence needed for more mundane tasks keeps getting cheaper.

That raises a question: if practical-tier inference commoditizes, how does any company justify the ever-larger capex to push the frontier?

OpenAI's pitch is that their business model should "scale with the value intelligence delivers." Concretely, that means moving beyond API fees into licensing and outcome-based pricing in high-value R&D sectors like drug discovery and materials science, where a single breakthrough dwarfs compute cost. That's one possible answer, though it's unclear whether the mechanism will work in practice.

popcorncowboy21d ago

> how does any company justify the ever-larger capex to push the frontier

AGI. [waves hands at the infinite money machine]

zozbot23422d ago

This effect is likely even larger when you consider that the raw cost per inferred token grows linearly with context, rather than being constant. So longer tasks performed with higher-context models will cost quadratically more. The computational cost also grows super-linearly with model parameter size: a 20B-active model is more than four times the cost of a 5B-active model.

tibbar22d ago

Doesn't context cacheing mostly eliminate this problem? (I suppose for enough context the 90% discount is eventually a lot anyway)

zozbot23422d ago

Context caching is really storing the KV-cache for reuse. It saves running prefill for that part of the context, but tokens referencing that KV-cache will still cost more.

boxedemp22d ago

If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

I think you're overestimating, or oversimplifying. Maybe both.

jurgenburgen22d ago

> If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

Assuming you used o3, that would cost $58800 per week. That’s an expensive bet for only 50% odds in your favor.

Of course the agents are only that good on benchmarks, in reality your odds are worse. Maybe roulette instead?

raincole22d ago

No one is claiming an agent can do 50% of arbitrary tasks. It's just 50% of METR's benchmark set.

> I think you're overestimating, or oversimplifying

Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

TeMPOraL22d ago

> Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

Curiously, for most submissions it's the opposite - comments are much more useful and nuanced than the source being discussed.

boxedemp22d ago

Sorry for stating something so obvious. I'll comment less from now on.

naveen9921d ago

Where are you getting hourly costs for private models ? The rate limits are pretty arbitrary. If you max out by api tokens it would be like $10k / hour

JAG_Ecalona21d ago

The sweet spot thing is the real insight here and nobody seems to be talking about it.

Frontier models get hyped for their maximum task horizon, but that's also where they're 10-30x more expensive per hour than their optimal range. You're paying a massive premium for the hardest tasks and still failing half the time.

Honestly the practical takeaway is pretty boring: just break your work into smaller chunks. Not because the models can't handle longer tasks, but because the economics at shorter task lengths are just way better. The labs are racing to push the horizon out; the smart move for anyone actually paying the bills is to stay near the sweet spot and orchestrate from there.

drzaiusx1121d ago

Model specialization is in all likelihood going to be the way forward, both for cost and quality of output. Smaller, cheaper models specialized in their task domains. Many of the current model vendors are already (attempting) to do this under the hood.

Generalist models have similar problems as generalist humans. The proverbial "Jack of all trades, master of none."

That said, I've made my career as a generalist :)

margalabargala21d ago

Anyone trying to decide which of 30 different specialized models best fits their task has already failed.

Maybe the future of the backend is specialized models but the future of what faces the user is what appears to be a generalist model. Maybe it does things itself, maybe it just knows how to route to the specialist models, but the UX of a generalist model will win.

drzaiusx1121d ago

Users shouldn't be picking models directly at all, unless they really want to (almost no one does), but certainly some will.

I meant more automatic selection and negotiation of which model gets which task based on filtering criteria, etc. so happening under the hood as you say.

zozbot23421d ago

Small chunks of work start to become viable for local agentic use too. The O(N^2) dependence on context length really makes the "maximum tasks" a complete non-starter locally.

nusl21d ago

You write awfully similar to the way LLMs do. I can't tell if it's just your writing style or not

yorwba20d ago

It's an LLM account, as you can tell from the hallucinated comment in the thread on Japanese penguins: https://news.ycombinator.com/item?id=47816256

dang22d ago

Related ongoing thread:

Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)

ting022d ago

No, but the AI labs would love to frame it this way so they can continue to nerf models and increase prices while they use the cheap, highly performant, highly powerful models internally to replace all of your businesses.

onchainintel21d ago

Sure is looking that way. What can't Claude do at this point?

rickandmorty9921d ago

I'm an AI engineer with a computer science and some actual AI background. I am trying to make Claude good motivation letters for applying to jobs. It currently scores a 6 out of 10. I'm much better still. And it has access to all the relevant parts of my psychology degree and data about writing good motivation letters.

All I can say is: the motivation letters don't look like they're written by AI anymore.

Grimblewald20d ago

List is massive. Anything novel? It'll fuck up without extreme handholding. Anything for which the components arent solved public published problems? It'll fuck it up.

Basically, claude can solve issues for you where it requires the implementation of existing code or a combination of existing patterns, but novel it cannot do.

kaoD21d ago

> What can't Claude do at this point?

Writing maintainable code that scales.

greenmilk22d ago

Are any inference providers currently making profit (on inference, I know google makes money)?

wsun1922d ago

Pretty much every major American inference provider claims to make a profit on API-based inference. Consumer plans might be subsidized overall, but it's hard to say since they're a black box and some consumers don't fully use their plans

henry202322d ago

Third parties selling open-weight inference on OpenRouter are surely selling on a profit. Zero reason to subsidize it.

greenmilk19d ago

They could be VC-backed and selling at a loss to grow market share

wavemode22d ago

Selling inference is not fundamentally different from selling compute - you amortize the lifetime cost of owning and operating the GPUs and then turn that into a per-token price. The risk of loss would be if there is low demand (and thus your facilities run underutilized), but I doubt inference providers are suffering from this.

Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.

Gigachad22d ago

There’s a lot of debate over what the useful lifespan of the hardware is though. A number that seems very vibes based determines if these datacenters are a good investment or disastrous.

hypercube3322d ago

I specifically remember this debate coming up when the H100 was the only player on the table and AMD came out with a card that was almost as fast in at least benchmarks but like half the cost. I haven't seen a follow up with real world use though and as a home labber I know that in the last three weeks the support for AMD stuff at least has gotten impressively useful covering even cuda if you enjoy pain and suffering.

What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.

raincole22d ago

All of them. It's simply impossible to sell tokens by usage at a loss now. You'll be arbitraged to death in a few days. It only makes sense to subsidize cost if you're selling a subscription.

avidiax21d ago

How do you arbitrage closed weight models? Who would buy from a middle man at increased price? Who is offering Priceline but for tokens?

dannersy22d ago

If they were they would show evidence because they'd pull in more investment. I don't believe their claim that they make profits on inference, especially not with reports like this coming out.

jagged-chisel22d ago

Google definitely makes money in other areas. Do they make money on inference?

quicklywilliam22d ago

Interesting read. I don't know if I quite buy the evidence, but it's definitely enough to warrant further investigation. It also matches up with my personal experience, which is that tools like Claude Code are burning through more and more tokens as we push them to do bigger and bigger work. But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.

So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.

esperent22d ago

> But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.

Do we? Because elsewhere in the thread there's people claiming they are profitable in API billing and might be at least close to break even on subscription, given that many people don't use all of their allowance.

ai-x22d ago

Anthropic has 50% gross margins on their tokens.

Step 1) Bubble callers will be proven wrong in 2026 if not already (no excess capacity)

Step 2) Models are not profitable are proven wrong (When Anthropic files their S1)

Step 3) FOMO and actual bubble (say around 2028/29)

dminik22d ago

If they had such a high margin, they wouldn't need to fuck around with token usage/pricing every three days.

I have no data to support this, but I think they just about break even on API usage and take overall loss on subscriptions/free plans.

ai-x21d ago

Math / Economics 101 thought experiment.

You have (limited) 100 Coke cans to sell (that you bought for say $1)

There are two large lines being formed for that. One line is offering an average $3 per bottle and another line is offering an average $2 per bottle.

Tell me which line they would throttle/starve even though they make a profit out of it.

Also, when the lines were formed you had no idea of the average price, but now you are getting a clear picture. Would you change your strategy / pricing or stick with your original "give the bottle to everyone for the same initial $1 price"

1 more reply

284848499522d ago

Can we see them?

ai-x21d ago

https://www.theinformation.com/articles/anthropic-lowers-pro...

I have access to that article

https://www.saastr.com/have-ai-gross-margins-really-turned-t...

Like I said, majority of people (including smart ones) are going to be surprised by the profit margins of AI labs and there will be a mad rush to buy AI stocks till it reaches bubble proportions.

2025 was merely a 1996 "Irrational Exuberance" moment. We haven't seen the late 1999 mania yet

agentifysh22d ago

Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.

Difference is that the current prices have a lot of subsidies from OPM

Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.

So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.

jiggawatts22d ago

> Until there is some drastic new hardware

For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.

During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.

New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.

There's also algorithmic improvements like the recently announced Google TurboQuant.

Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.

zozbot23422d ago

> Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)

jiggawatts22d ago

For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

> becomes negligible at scale

Nothing is negligible at scale! Both the cost and power draw of the HBMs is a limiting factor for the hyperscalers, to the point that Sam Altman (famously!) cornered the market and locked in something like 40% of global RAM production, driving up prices for everyone.

> sharding a single model over large amounts of GPUs

A single host server typically has 4-16 GPUs directly connected to the motherboard.

A part of the reason for sharding models between multiple GPUs is because their weights don't fit into the memory of any one card! HBF could be used to give each GPU/TPU well over a terabyte of capacity for weights.

Last but not least, the context cache needs to be stored somewhere "close" to the GPUs. Across millions of users, that's a lot of unique data with a high churn rate. HBF would allow the GPUs to keep that "warm" and ready to go for the next prompt at a much lower cost than keeping it around in DRAM and having to constantly refresh it.

1 more reply

colechristensen22d ago

Doubtful, local models are the competitive future that will keep prices down.

128GB is all you need.

A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.

Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.

hypercube3322d ago

Weird how you're leaving stuff like Strix Halo out. Also weird you think 128gb is the future with all of the research done to reduce that to something around 12GB being a target with all of these papers out now. I assume we'll end up with less general purpose models and more specific small ones swapped out for whatever work you are asking to do.

MrBuddyCasino22d ago

Strix Halo hasn‘t got nearly enough bandwidth, its just 256bit.

1 more reply

lookaround22d ago

> 128GB is all you need.

My guy, look around.

They are coming for personal compute.

Where are you going to get these 128GBs? Aquaman? [0]

The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.

[0] https://www.youtube.com/watch?v=0-w-pdqwiBw

naveen9922d ago

Cloud can’t make money off of you and pay more than you for the hardware at the same time.

2 more replies

foota22d ago

More like RAM producers are providing supplies to the highest bidder, no? If this doesn't peter out supply will normalize at a higher but less insane price eventually.

hyperpape21d ago

This is an interesting analysis, but "are the costs of AI agents also rising exponentially is?" is a very bad question that this doesn't answer.

What's rising exponentially is the price of the most ambitious thing cutting edge agents can do.

But to answer whether the cost of AI agents is rising in general, you would take a fixed set of problems, and for each of them, ask "once it's solvable, how does the price change?"

For that latter question, there isn't a lot of data in these charts because there aren't enough curves for models of the same family over time, but it does look like there are a number of points where newer models solve the same problems at lower prices. Look at GPT5 vs. the older GPT models--the curve for GPT5 is shifted left.

simianwords21d ago

The cost of models are almost exponentially decreasing with time.

The author performs a non sequitur by muddling two concepts of time. They say costs are getting “unsustainable” which is not a conclusion that follows.

What is true is that at a given point in time, cost to perform a task is exponentially related to the human time taken. But it does not mean it will remain that way.. far from it.

matt321022d ago

I took a month break and my side project took 2x as much tokens

lwhi22d ago

I think an interesting counterpoint, is whether the value obtained is reducing.

jillesvangurp21d ago

Exactly. AI is not really replacing people but it's definitely allowing them to do more and more interesting things. You should offset the cost of having an AI do something against the cost of doing that manually. Your mileage may vary of course. But I am definitely getting things done that I wouldn't even have started without AI assistance. And that stuff is valuable to me. Although you could argue that anything AI can do is actually deflating in value as well. The economics here will get pretty interesting. But all things considered, I'm not spending an unreasonable amount on all this AI stuff. Probably around 60-100$/month currently. It varies a bit.

lwhi21d ago

Value feels pretty relative to me. If anyone can do a 'thing', is that thing worth less?

jillesvangurp21d ago

If you still need that thing done, the value is basically however you value your time. Would you pay extra for having someone or something do that for you instead?

1 more reply

siliconc0w22d ago

Working on a oss tool to help orgs identify where they can save on token costs: https://repogauge.org

Happy to run it on your repos for a free report: hi@repogauge.org

EdvinPL22d ago

AI feels more like a gamble. People like gambling. From casinos (win-loose), to lootboxes (uncertainty) or even extramarital sex (whose baby is it?).

This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.

Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.

Sadly - from by experiences hiring Devs - hiring people is also a gamble...

ketzu21d ago

> or even extramarital sex (whose baby is it?).

This is the weirdest example of "gambling" I have seen in my life. If you'd've written "unprotected sex" I'd see the gambling part, but "extramartial sex" covers so much more than the tiny subset of "whose baby is it" (how many people are there having sex to gamble on who will be the father of a baby? 10?).

This made my day.

sh4rks21d ago

Writing code by hand is gambling (will it compile or not, will it pass code review or not)

stavros21d ago

Under this definition, everything is gambling, including commenting on HN (will I get upvoted or downvoted?).

pocksuppet21d ago

There used to be forums without voting. It was discovered that forums with voting attract more engagement because of the emotions produced by the voting.

BarryMilo21d ago

It also used to be that reddit comments were the epitome of quality in their time, much closer to current HN if not better. I attributed that to the voting mechanism; clearly I was mistaken.

sirl1on21d ago

Everything is modeled after gambling nowadays to nudge and gamify the user experience to a desired outcome

stavros21d ago

If your definition of "gambling" is "things with uncertain outcome", then there's nothing in life that's not gambling.

metalglot21d ago

Not if you use local models, only if you're using cloud llm's

noosphr22d ago

Yet again: Transformers are fundamentally quadratic.

If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.

Project costs from Claude Code bear this out in the real world.

twaldin22d ago

idk over my testing, glm-5 inside opencode beats all other agents head to head

stainablesteel21d ago

it's not like cost and energy use aren't competitive factors in this game

the first model to outcompete its competitors while using less compute would be purchased more than anything else

keepamovin22d ago

My expectation: demand going up, prices will rise, supply will saturate to the point of ubiquitous "utility" status, and prices will drop, probably a bell curve shape with sine-wave undulations along the way.

chii22d ago

> supply will saturate

that depends on the ability to produce supply at a saturation rate.

It did work for internet backhaul links - ala, those dark fibres. However, i reckon those fibres are easier to manufacture than silicon chips.

I wonder if saturation is possible for ai capable chips.

keepamovin21d ago

no concerns for the hardware chain long view. saturate is AI as utility ubiquitous in everything.

simianwords21d ago

Why does the author suggest that costs are getting unsustainable when costs are actually decreasing exponentially over time?

It’s true that at a given point in time the cost to achieve a certain task follows exponential curve against time taken by a human. But.. so what?

twotwotwo21d ago

You could model more of the process: the dev's work as well as the model's, and the cost of catching a bug later or deploying it live. Those tasks push me further towards smaller tasks in general. (And they make the Gas Town type stuff seem more baffling.)

- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.

- Greater certainty provides a better development experience. I've heard people talk about how LLM development can be tiring. One way that happens, I think, is the win-or-lose drama of feeding in huge tasks with a substantial chance of failure. I think if you're succeeding 95% of the time instead of 70%, and the 5% are easier to deal with (smaller chunks to debug), it's a better experience.

- Everything is harder about real-world tasks because they aren't clean verifiable-reward benchmarks. Developers have context that models don't, so it's common that a problem traces to an detail not in the spec where the model guessed wrong. For real-world tasks "failures" are also sometimes "that UI is bad" or "that way of coding it is hard to maintain." And it's possible to have problems the dev simply doesn't notice. The benchmarks' fully computer-checkable outcomes are 'easy mode' compared to the real world.

- Fixing agents' mess becomes more work as task sizes increase. (Like the certainty thing, but about cost in hours than the experience.) Again, if the model has spat out 1000 lines and stumped itself debugging a failure, it'll take you some time to figure out: more time than debugging 250-line patch, and the larger patch is more likely to have bugs. And if an issue bug makes it out to peer review, you can add communication and context-switching cost (point out bug, fix, re-review) on top of that.

- Bugs that reach prod are really expensive. More of a problem when a prod bug can lose you customers vs., say, on most hobby things. Ord's post gestures at it: there are "cases where failure is much worse than not having tried at all." That magnifies how important it is the review be good, and how much of a problem bugs that sneak through are, which points towards doing smaller chunks.

How significant each factor is depends on details: how easy the task is to verify, how well-specified it is (and more generally how much it's in the models' wheelhouse, and how much in mine), how bad a bug would be (fun thing? internal tool? user facing? can lose data?).

I think the dynamics above apply across a range of model strengths, but that doesn't mean the changes from say Sonnet 3.7 to Opus 4.5 didn't mean anything; the machine getting better at getting the info it needs and checking itself still helps at shorter task lengths. Harness improvements can help, e.g. they could help keep models of the 'too much context, model got silly' zone (may be less severe than it once was, but I suspect will remain a thing), build better context, and clean up code as well as spitting results out.

Besides taking more of your time up front, involving yourself more also tends to drift towards you making more of the lower-level decisions about how the code will look, which I find double-edged. You have better broad context, and you know what you find maintainable. But the implementer, model or another person, is closer to the code, which helps it make some mid-to-low-level decisions well.

Plan modes and Spec-Kit type things can help with the balance of getting involved but letting the model do its thing. I've liked asking the LLM to ask a lot of questions and surface doubts. A colleague messed with Spec-Kit so it would pick one change on its fine-grained to-do list at a time, which is a neat hack I'd like to try sometime.

j / k navigate · click thread line to collapse

130 comments

easygenes22d ago

avidphantasm21d ago

bbatha21d ago

onchainintel21d ago

smusamashah21d ago

535188B17C9374321d ago

I always wondered why the equivalent of integrated mining didn't apply to LLM inference... now it turns out it does and there's a company making it fast and robust!

bee_rider21d ago

An ASIC for bitcoin mining makes more sense, in that the algorithm is basically “set.” For LLMs, it is hard while models are still developing.

radialstub21d ago

smusamashah21d ago

But isn't this happening here https://taalas.com/ already. They have a demo of llama running at 17000 tokens per second https://chatjimmy.ai/

gjsman-100021d ago

With some research, that chip appears like it would cost about $300-$400 to manufacture, die only.

For an 8B parameter model.

3 more replies

margalabargala21d ago

You mean the person saying "I won't tell you why" might not know what they're talking about?! Say it ain't so.

pindab0ter20d ago

I just tried chatjimmy.ai for a bit and while it is absolutely blazingly fast, it's also not a very strong model. I suppose that with time, stronger models will be able to run on such hardware, too.

zipy12421d ago

allan_s21d ago

drzaiusx1121d ago

darepublic21d ago

1 more reply

segmondy21d ago

Humans are not cheaper than AI models. Let's go with $35 an hour.

24365 = 8760 8760$35 = $306,600

Yeah, a human working non stop will run $300k.

The thing AI agents are lacking is agency and autonomy. As they get closer and closer, the majority of humans competing in the same sort of tasks will have no chance.

zipy12421d ago

I'm going by the graph in the original artical, not stating a point. I mean if their cost lines on the graph are to be believed, then the number I quoted is cheaper.

ac2921d ago

segmondy21d ago

I never said that, my point is that paying 20 people at $35 for 24/7 is about $6 million. You can replace that with a $10k system running 20 parallel requests for a year and save lots of money.

rcbdev21d ago

Software developers average about 80 - 120€/h in Austria. How are UK rates so bad?

zipy12420d ago

A quick google seems to suggest the median software pay in Austria is about €60k. That translates to an hourly wage of roughly 31€/h .

Your range of 80-120 gives a yearly wage of like 155k-234k which is clearly far too high for an average.

rcbdev20d ago

You'd have to be an expert on labor law in every country you seek to honestly compare. The only clean way to simply compare earnings is freelance rates.

mike_hearn20d ago

The discrepancy is probably consulting vs salaried.

GorbachevyChase21d ago

avidiax21d ago

My understanding is that this is normalized to the "best human" for the tasks.

An AI only doing a task correctly 50% of the time may in-fact be better than your N% chance of hiring a highly capable human for that task, and especially for contracting a human to a 1-2 hour task.

thelastgallon22d ago

nopinsight22d ago

That raises a question: if practical-tier inference commoditizes, how does any company justify the ever-larger capex to push the frontier?

popcorncowboy21d ago

> how does any company justify the ever-larger capex to push the frontier

AGI. [waves hands at the infinite money machine]

zozbot23422d ago

tibbar22d ago

Doesn't context cacheing mostly eliminate this problem? (I suppose for enough context the 90% discount is eventually a lot anyway)

zozbot23422d ago

Context caching is really storing the KV-cache for reuse. It saves running prefill for that part of the context, but tokens referencing that KV-cache will still cost more.

boxedemp22d ago

If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

I think you're overestimating, or oversimplifying. Maybe both.

jurgenburgen22d ago

> If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

Assuming you used o3, that would cost $58800 per week. That’s an expensive bet for only 50% odds in your favor.

Of course the agents are only that good on benchmarks, in reality your odds are worse. Maybe roulette instead?

raincole22d ago

No one is claiming an agent can do 50% of arbitrary tasks. It's just 50% of METR's benchmark set.

> I think you're overestimating, or oversimplifying

Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

TeMPOraL22d ago

> Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

Curiously, for most submissions it's the opposite - comments are much more useful and nuanced than the source being discussed.

boxedemp22d ago

Sorry for stating something so obvious. I'll comment less from now on.

naveen9921d ago

Where are you getting hourly costs for private models ? The rate limits are pretty arbitrary. If you max out by api tokens it would be like $10k / hour

JAG_Ecalona21d ago

The sweet spot thing is the real insight here and nobody seems to be talking about it.

drzaiusx1121d ago

Generalist models have similar problems as generalist humans. The proverbial "Jack of all trades, master of none."

That said, I've made my career as a generalist :)

margalabargala21d ago

Anyone trying to decide which of 30 different specialized models best fits their task has already failed.

drzaiusx1121d ago

Users shouldn't be picking models directly at all, unless they really want to (almost no one does), but certainly some will.

I meant more automatic selection and negotiation of which model gets which task based on filtering criteria, etc. so happening under the hood as you say.

zozbot23421d ago

Small chunks of work start to become viable for local agentic use too. The O(N^2) dependence on context length really makes the "maximum tasks" a complete non-starter locally.

nusl21d ago

You write awfully similar to the way LLMs do. I can't tell if it's just your writing style or not

yorwba20d ago

It's an LLM account, as you can tell from the hallucinated comment in the thread on Japanese penguins: https://news.ycombinator.com/item?id=47816256

dang22d ago

Related ongoing thread:

Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)

ting022d ago

onchainintel21d ago

Sure is looking that way. What can't Claude do at this point?

rickandmorty9921d ago

All I can say is: the motivation letters don't look like they're written by AI anymore.

Grimblewald20d ago

List is massive. Anything novel? It'll fuck up without extreme handholding. Anything for which the components arent solved public published problems? It'll fuck it up.

Basically, claude can solve issues for you where it requires the implementation of existing code or a combination of existing patterns, but novel it cannot do.

kaoD21d ago

> What can't Claude do at this point?

Writing maintainable code that scales.

greenmilk22d ago

Are any inference providers currently making profit (on inference, I know google makes money)?

wsun1922d ago

henry202322d ago

Third parties selling open-weight inference on OpenRouter are surely selling on a profit. Zero reason to subsidize it.

greenmilk19d ago

They could be VC-backed and selling at a loss to grow market share

wavemode22d ago

Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.

Gigachad22d ago

There’s a lot of debate over what the useful lifespan of the hardware is though. A number that seems very vibes based determines if these datacenters are a good investment or disastrous.

hypercube3322d ago

What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.

raincole22d ago

All of them. It's simply impossible to sell tokens by usage at a loss now. You'll be arbitraged to death in a few days. It only makes sense to subsidize cost if you're selling a subscription.

avidiax21d ago

How do you arbitrage closed weight models? Who would buy from a middle man at increased price? Who is offering Priceline but for tokens?

dannersy22d ago

If they were they would show evidence because they'd pull in more investment. I don't believe their claim that they make profits on inference, especially not with reports like this coming out.

jagged-chisel22d ago

Google definitely makes money in other areas. Do they make money on inference?

quicklywilliam22d ago

esperent22d ago

> But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.

ai-x22d ago

Anthropic has 50% gross margins on their tokens.

Step 1) Bubble callers will be proven wrong in 2026 if not already (no excess capacity)

Step 2) Models are not profitable are proven wrong (When Anthropic files their S1)

Step 3) FOMO and actual bubble (say around 2028/29)

dminik22d ago

If they had such a high margin, they wouldn't need to fuck around with token usage/pricing every three days.

I have no data to support this, but I think they just about break even on API usage and take overall loss on subscriptions/free plans.

ai-x21d ago

Math / Economics 101 thought experiment.

You have (limited) 100 Coke cans to sell (that you bought for say $1)

There are two large lines being formed for that. One line is offering an average $3 per bottle and another line is offering an average $2 per bottle.

Tell me which line they would throttle/starve even though they make a profit out of it.

1 more reply

284848499522d ago

Can we see them?

ai-x21d ago

https://www.theinformation.com/articles/anthropic-lowers-pro...

I have access to that article

https://www.saastr.com/have-ai-gross-margins-really-turned-t...

Like I said, majority of people (including smart ones) are going to be surprised by the profit margins of AI labs and there will be a mad rush to buy AI stocks till it reaches bubble proportions.

2025 was merely a 1996 "Irrational Exuberance" moment. We haven't seen the late 1999 mania yet

agentifysh22d ago

Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.

Difference is that the current prices have a lot of subsidies from OPM

Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.

So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.

jiggawatts22d ago

> Until there is some drastic new hardware

For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.

New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.

There's also algorithmic improvements like the recently announced Google TurboQuant.

zozbot23422d ago

> Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

jiggawatts22d ago

For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

> becomes negligible at scale

> sharding a single model over large amounts of GPUs

A single host server typically has 4-16 GPUs directly connected to the motherboard.

1 more reply

colechristensen22d ago

Doubtful, local models are the competitive future that will keep prices down.

128GB is all you need.

Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.

hypercube3322d ago

MrBuddyCasino22d ago

Strix Halo hasn‘t got nearly enough bandwidth, its just 256bit.

1 more reply

lookaround22d ago

> 128GB is all you need.

My guy, look around.

They are coming for personal compute.

Where are you going to get these 128GBs? Aquaman? [0]

The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.

[0] https://www.youtube.com/watch?v=0-w-pdqwiBw

naveen9922d ago

Cloud can’t make money off of you and pay more than you for the hardware at the same time.

2 more replies

foota22d ago

More like RAM producers are providing supplies to the highest bidder, no? If this doesn't peter out supply will normalize at a higher but less insane price eventually.

hyperpape21d ago

This is an interesting analysis, but "are the costs of AI agents also rising exponentially is?" is a very bad question that this doesn't answer.

What's rising exponentially is the price of the most ambitious thing cutting edge agents can do.

But to answer whether the cost of AI agents is rising in general, you would take a fixed set of problems, and for each of them, ask "once it's solvable, how does the price change?"

simianwords21d ago

The cost of models are almost exponentially decreasing with time.

The author performs a non sequitur by muddling two concepts of time. They say costs are getting “unsustainable” which is not a conclusion that follows.

What is true is that at a given point in time, cost to perform a task is exponentially related to the human time taken. But it does not mean it will remain that way.. far from it.

matt321022d ago

I took a month break and my side project took 2x as much tokens

lwhi22d ago

I think an interesting counterpoint, is whether the value obtained is reducing.

jillesvangurp21d ago

lwhi21d ago

Value feels pretty relative to me. If anyone can do a 'thing', is that thing worth less?

jillesvangurp21d ago

If you still need that thing done, the value is basically however you value your time. Would you pay extra for having someone or something do that for you instead?

1 more reply

siliconc0w22d ago

Working on a oss tool to help orgs identify where they can save on token costs: https://repogauge.org

Happy to run it on your repos for a free report: hi@repogauge.org

EdvinPL22d ago

AI feels more like a gamble. People like gambling. From casinos (win-loose), to lootboxes (uncertainty) or even extramarital sex (whose baby is it?).

This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.

Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.

Sadly - from by experiences hiring Devs - hiring people is also a gamble...

ketzu21d ago

> or even extramarital sex (whose baby is it?).

This made my day.

sh4rks21d ago

Writing code by hand is gambling (will it compile or not, will it pass code review or not)

stavros21d ago

Under this definition, everything is gambling, including commenting on HN (will I get upvoted or downvoted?).

pocksuppet21d ago

There used to be forums without voting. It was discovered that forums with voting attract more engagement because of the emotions produced by the voting.

BarryMilo21d ago

It also used to be that reddit comments were the epitome of quality in their time, much closer to current HN if not better. I attributed that to the voting mechanism; clearly I was mistaken.

sirl1on21d ago

Everything is modeled after gambling nowadays to nudge and gamify the user experience to a desired outcome

stavros21d ago

If your definition of "gambling" is "things with uncertain outcome", then there's nothing in life that's not gambling.

metalglot21d ago

Not if you use local models, only if you're using cloud llm's

noosphr22d ago

Yet again: Transformers are fundamentally quadratic.

If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.

Project costs from Claude Code bear this out in the real world.

twaldin22d ago

idk over my testing, glm-5 inside opencode beats all other agents head to head

stainablesteel21d ago

it's not like cost and energy use aren't competitive factors in this game

the first model to outcompete its competitors while using less compute would be purchased more than anything else

keepamovin22d ago

chii22d ago

> supply will saturate

that depends on the ability to produce supply at a saturation rate.

It did work for internet backhaul links - ala, those dark fibres. However, i reckon those fibres are easier to manufacture than silicon chips.

I wonder if saturation is possible for ai capable chips.

keepamovin21d ago

no concerns for the hardware chain long view. saturate is AI as utility ubiquitous in everything.

simianwords21d ago

Why does the author suggest that costs are getting unsustainable when costs are actually decreasing exponentially over time?

It’s true that at a given point in time the cost to achieve a certain task follows exponential curve against time taken by a human. But.. so what?

twotwotwo21d ago

- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.

j / k navigate · click thread line to collapse