Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.
Is this empirical evidence?
And this is not only my experience.
Calling this phychological is gaslighting.
Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.
But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.
Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.
Well, if we see this way, this is true for Antrophic’s benchmarks as well.
Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”
So what I described is the exact definition of empirical.
Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.
Whether something is a bug or feature.
Whether the right thing was built.
Whether the thing is behaving correctly in general.
Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.
Whether fast results are more important than absolutely correct results for a given context.
Yes, all things above are also related with each other.
The most we have for LLMs is tallying up each user's experience using an LLM for a period of time for a wide rane of "compelling" use cases (the pairing of their prompts and results are empirical though right?).
This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.
Why? Because humans suck.
The only thing that matters and that can evaluate performance is the end result.
But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?
Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?
Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.
The way this works is:
1) x% of users have an exceptional first experience by chance. Nobody who has a meh first experience bothers to try a second time. 2) x²% of users also have an exceptional second experience by chance 3) So a lot of people with a great first experience think the model started off great and got suddenly worse
Suppose it's 25% that have a really great first experience. 25% of them have a great second experience too, but 75% of them see a sudden decline in quality and decide that it must be intentional. After the third experience this population gets bigger again.
So by pure chance and sampling biases you end up convincing a bunch of people that the model used to be great but has gotten worse, but a much smaller population of people who thought it was terrible but got better because most of them gave up early.
This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.
The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.
After a few weeks of having them for breakfast, you start to notice that the pancakes are actually kind of overcooked and don’t taste that good. Sure it’s impressive that a dog made them, but what use are sub-par pancakes? You’re naturally more focused on what it can’t do than what it can.