I was surprised at the number and calibre of orgs that came to me who would basically say anything I wanted for cash, this opened my eyes a lot and made me very suspicious of published media.
People get excited on an update in such a rapidly changing space. It really is that simple.
My theory is that LLMs will get commoditized within the next year. The edge that OpenAI had over the competition is arguably lost. If the trend continues we will be looking at inference like commodity prices, where the most efficient like cerebras and groq will be the only ones actually making money at the end.
Anyway, I'm sure gpt-5 will be AGI.
Yes, this is why apple famously just dumped the original iphone on the market without telling anybody about it ahead of time.
That's certainly not how the first iphone is usually described.
If it were good, there would also be no need for their devs to tell people to temper their expectations, alas... [0]
[0] https://xcancel.com/alexwei_/status/1946477756738629827#m
Incredibly bad theory, it's like you're saying every LLM is the same because they can all talk, even though the newer ones continue to smash through benchmarks the older ones couldn't. And now it happens quarterly instead of yearly, so you can't even say it's slowing down.
In my mind there are really three dimensions they can differentiate on: cost, speed, and quality. Cost is hard because they’re already losing money. Speed is hard because differentiation would require better hardware (more capex).
For many tasks, perhaps even a majority right now, quality of free models is approaching good enough.
OpenAI could create models which are unambiguously more reliable than the competition, or ones which are able to answer questions no other model can. Neither of those has happened yet afaik.
I get the impression that OpenAI will rename what's intended as o4 to gpt-5 and package it as such.
Nothing annoys me more than OpenAI acting like something is rolling out (or rolled out already) and then taking forever to do so.
> ChatGPT agent starts rolling out today to Pro, Plus, and Team; Pro will get access by the end of day, while Plus and Team users will get access over the next few days.
"Next few days" - It's 8 days later (so far). Lest one think "It's only 8 days, geez, calm down": They do this _all the time_. I don't even remember the length of the gap between announcing the enhanced voices and then forgetting about it completely before it finally rolled out.
It sours every announcement they make in my opinion.
If he didn't understand the question how could he know the model answered it perfectly?
It makes selling improvements fairly hard actually. If the last model already wrote an amazing poem about hot dogs, the English language doesn’t have superlatives to handle what the next model creates.
If we are being pedantic we could never accept "question we don't know how to answer" as a possible interpretation of "question we don't understand".
Also, 'thing that I don't know about but is broadly and uncontroversially known and documented by others' is sort of dead center of the value proposition for current-generation LLMs and also doesn't make very impressive marketing copy.
Unless he's saying that he fed it an unknown-to-experts-in-the-field question and it figured it out in which case I am very skeptical.
But more seriously - it's a ridiculous statement to think you understand the answer when you don't understand the question in the first place..
Turning to their reasoning models, it’s also known and documented through SimpleQA and PersonQA that OpenAI o3 hallucinates more than o1, and o4-mini even more than o3. There’s an unmanaged issue where training on synthetic data improves benchmark results on STEM tasks but increases hallucination rates, especially troubling OpenAI models for some reason (my guess: they’re fine-tuned to take risks since it’s known to also increase likelihood of getting it right for hard tasks?)
Google has long known OpenAI struggles with hallucinations more than them according to an anonymous Googler that I saw commented on this. This has been verified by the aforementioned benchmarks. Anthropic also struggles less. But as far as I can tell, they’re all facing issues with synthetic data acting like a double edged sword.
So GPT-5 is going to be interesting. How well it exactly does will bear a lot of meaning for the kind of trouble OpenAI is in right now. Maybe OpenAI has found a novel approach in reducing hallucinations? I think that’s among their most crucial points right now. But other than this, no, I don’t expect a revolution, only an evolution. They might currently win benchmarks, but it will hardly be something that catapults them.
If GPT-5 underwhelms, it will bear a stronger signal than merely the one that GPT-5 underwhelms. Because then OpenAI has trouble with both non-reasoning and reasoning models, and we’re likely to be looking at the end of the road on the horizon for current GPT based LLM’s and one where the winner will probably ultimately be cheaper open weight models once they catch up.
This happened because training progressively larger models used to be the main path forward, which was easy to track and name, but currently it's all about quickly incorporating synthetic data chain-of-thoughts created by flash models.
Also programming needs to be redesigned from the ground up as LLM first.
The most positive metaphor I have heard about why LLM coding assistance is so great is that it's like having a hard-working junior dev that does whatever you want and doesn't waste time reading HN. You still have to check the work, there will be some bad decisions in there, the code maybe isn't that great, but you can tell it to generate tests so you know it is functional.
OK, let's say I accept that 100% (I personally haven't seen evidence that LLM assistance is really even up to that level, but for the sake of argument). My experience as a senior dev is that adding juniors to a team slows down progress and makes the outcome worse. You only do it because that's how you train and mentor juniors to be able to work independently. You are investing in the team every time you review a junior's code, give them advice, answer their questions about what is going on.
With an LLM coding assistant, all the instruction and review you give it is just wasted effort. It makes you slower overall and you spend a lot of time explaining code and managing/directing something that not only doesn't care but doesn't even have the ability to remember what you said for the next project. And the code you get out, in my experience at least, is pretty crap.
I get that it's a different and, to some, interesting way of programming-by-specification, but as far as I can tell the hype about how much faster and better you can code with an AI sidekick is just that -- hype. Maybe that will be wrong next year, maybe it's wrong now with state-of-the-art tools, but I still can't help thinking that the fundamental problem, that all the effort you spend on "mentoring" an LLM is just flushed down the toilet, means that your long term team health will suffer.'
I think that belies the fundamental misunderstanding of how AI is changing the goalposts in coding
Software engineering has operated under a fundamental assumption that code quality is important.
But why do we value the "quality" of code?
* It's easier for other developers (including your future self) to understand, and easier to document. * Easier to change when requirements change * More efficient with resources, performs better (cpu/network/disk) * Easier to develop tests if its properly structured
AI coding upends a lot of that, because all of those goals presume a human will, at some point, interact with that code in the future.
But the whole purpose of coding in the first place is to have a running executable that does what we want it to do.
The more we focus on the requirements and guiding AI to write tests to prove those requirements are fulfilled, the less we have to actually care about the 'quality' of the code it produces. Code quality isn't a requirement, its a vestigal artifact of human involvement in communicating with the machine.
Yes, because non-deterministic systems are great softwares. I mean, who want's repeatable execution on the control program for their nuclear submarine or their hospital lighting controls. Why would anyone want a computer capable of actual math running on the President's nuclear "football" when we can have the outputs of hallucinating tools running there.
Screw their organization verification. I am taking my business to Claude or Deepseek.