Yet the data doesn’t show all that much difference between SOTA models. So I have a hard time believing it.
Like, I think there's definitely value in prompting a dozen LLMs with a detailed description of a CMS you want built with 12 specific features, a unit testing suite and mobile support, and then timing them to see how long they take and grading their results. But that's not how most developers use an LLM in practice.
Until LLMs become reliable one-shot machines, the thing I care most about is how well they augment my problem solving process as I work through a problem with them. I have no earthly idea of how to measure that, and I'm highly skeptical of anyone who claims they do. In the absence of empirical evidence we have to fall back on intuition.
I found this worked suprisingly well, I was certain 'claude' was best, while they like grok and someone else liked ChatGPT. Some AIs just end up fitting best with how you like to chat I think. I do definately also find claude best for coding with as well.
Thing is, everyone knows the benchmarks are being gamed. Exactly how is besides the point. In practice, anecdotally, Opus 4.5 is noticably better than 4, and GPT 5.2 has also noticably improved. So maybe the real question is why do you believe this data when it seems at odds with observations by humans in the field?
> Jeff Bezos: When the data and the anecdotes disagree, the anecdotes are usually right.
https://articles.data.blog/2024/03/30/jeff-bezos-when-the-da...
Most of what I can do now with them I could do half a year to a year ago. And all the mistakes and fail loops are still there, across all models.
What changed is the number of magical incantations we throw at these models in the form of "skills" and "plugins" and "tools" hoping that this will solve the issue at hand before the context window overflows.
Unfortunately, I think that the overlap between actual model improvements and what people perceive as "better" is quite small. Combine this with the fact that most people desperately want to have a strong opinion on stuff even though the factual basis is very weak.. "But I can SEE it is X now".
I would think this also correlates with the type of person who hasn't done enough data analysis themselves to understand all the lies and misleading half-truths "data" often tells. In the reverse also, that experience with data inoculates one to some degree against the bullshitting LLM so it is probably easier to get value from the model.
I would imagine there are all kinds of factors like this that multiple so some people are really having vastly different experiences with the models than others.