We went from simple chatbots to thinking models which massively exploded token utilization.
We then went from simple thinking models to tool calls and agents. Agents, and particularly long horizon agents, burn truly insane numbers of tokens blowing thinking models well out of the water.
People are trying to do agentic swarms as the next step but I don't think those make sense as of right now. Particularly they are just too insanely expensive and not that useful.
Plus right now the models just aren't good at it. It's like early agents when they first started making tool calls.
Agents are really quite bad at using subagents. They don't really internalize how to deploy them and they also don't utilize them in the ways that make sense (produce planning documents, have verifiable artifacts, break down tasks in ways that minimize risk, recognize model limitations in instruction following, iterate on results, etc).
Your last paragraph is also striking in that it exemplifies how far away from general intelligence they still are.
Most of everything tends to suck. Most projects go nowhere, most companies fail, most scientific papers are garbage.
> how far away from general intelligence they still are
Economically the real question is to what extent can these systems replace or augment human labour. And I think right now the extent is pretty shocking if not currently very well integrated.
Scientifically the fact they are bad at using subagents is sort of expected. How to use agents effectively is still a bit of an open question. A human from mid 2025 would be bad at it. Why should a model trained on data from 2025 be good at it?
If these things were to be generally intelligent they need feedback and retraining. Which persumable the Labs will do once these sorts of questions start having good answers and we can create good benchmarks and measures for meta orchestration.
Claude uses up its 6 hour or whatever quota in a couple coding prompts. Buying extra credits for the same amount as a monthly subscription and it's used up in 3 hours.
Kimi gives me about double what Claude does per window but uses up its entire weekly quota in the same time, for the same price as Claude. And I get worse results.
Gemini worked OK for a day or two and now is running one tool every 30m and getting nothing done, apparently they've been in constant outage status for for nearly a month: https://aistudio.google.com/status
I haven't tried ChatGPT because of ethical issues but well, I'm not sure that makes any sense.
Four prompts a day isn't something where I go, wow, this has revolutionized my programming. I might very well be getting more done if I wasn't fighting with the constant CLI bugs and work left half finished for 3h to 5 days when my quota is used up.
At some point in the next few years investors are going to want their returns. The only way I see that happening is though an IPO and then… I don’t know if they have a sustainable business model or one in sight.
Which doesn't mean an end to the AI race, since China is unlikely to care whether US companies secure financing
Also, if this happens OpenAI will probably bailed out with taxpayer money
Anthropic is capturing exploding enterprise demand via their agentic tools, OpenAI is failing (relatively) to do so. They’re stuck trying to squeeze more $$ out of consumer chatbots that have reached the second knee of the S-curve.
The obvious story seems to be that OpenAI was reckless and got way ahead of their revenue assuming it would keep hockey-sticking