The models were mostly GPT-5 and Claude Sonnet 4. The study was too early to catch the 5.x Codex or Claude 4.5 models (bar one mention of Sonnet 4.5.)
This is notable because a lot of academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.
This is a recurring argument which I don't understand. Doesn't it simply mean that whatever conclusion they did was valid then? The research process is about approximating a better description of a phenomenon to understand it. It's not about providing a definitive answer. Being "an entire model generation" behind would be important if fundamental problems, e.g. no more hallucinations, would be solved but if it's going from incremental changes then most likely the conclusions remain correct. Which fundamental change (I don't think labeling newer models as "better" is sufficient) do you believe invalidate their conclusions in this specific context?
Just the jump from Sonnet 3.5 to 3.7 to 4.5, and Opus 4.5 has been pretty massive in terms of holistic reasoning, deep knowledge as well as better procedural and architectural adherence.
GPT-5 Pro convinced me to pay $200/mo for an OpenAI subscription. Regular 5.2 models, and 5.2 codex, are leagues better than GPT-4 when it comes to solving problems procedurally, using tools, and deep discussion of scientific, mathematic, philosophical and engineering problems.
Models have increasingly longer context, especially some Google models. OpenAI has released very good image models, and great editing-focused image models in general have been released. Predictably better multimodal inference over the short term is unlocking many cool near-term possibilities.
Additionally, we have seen some incredible open source and open weight models released this year. Some fully commercially viable without restriction. And more and more smaller TTS/STT projects are in active development, with a few notable releases this year.
Honestly, the landscape at the end of the year is impressive. There has been great work all over the place, almost too much to keep up with. I'm very interested in the Genie models and a few others.
For an idea:
At the beginning of the year, I was mildly successful getting at coding models to make changes in some of my codebases, but the more esoteric problems were out of reach. Progress in general was deliberate and required a lot of manual intervention.
By comparison, in the last week I've prototyped six applications at levels that would take me days to weeks individually, often developing multiple at the same time, monitoring agentic workflows and intervening only when necessary, relying on long preproduction phases with architectural discussions and development of documentation, requirements, SDDs... and detailed code review and refactoring processes to ensure adherence to constraints. I'm morphing from a very busy solo developer into a very busy product manager.
A paper comes out that says "we did a study of developers and found that AI-assistance had no impact on their productivity (using the state of the art models available in September 2024) and a lot of people will point to that as incontestable evidence that "AI doesn't work".
Certainly some scientists are just absurdly efficient and all 28 involved teams, but that’s still a lot.
Personally speaking, this gives me second thoughts about their dedication to truly accurately measuring something as notoriously tricky as corporate SWE performance. Any number of cut corners in a novel & empirical study like this would be hard to notice from the final product, especially for casual readers…TBH, the clickbait title doesn’t help either!
I don’t have a specific critique on why 4 months is definitely too short to do it right tho. Just vibe-reviewing, I guess ;)
It takes about 6 months to figure out how to get LaTeX to position figures where you want them, and then another 6 months to fight with reviewers
Results are getting worse and less accurate, hell, I even had Claude drop some Chinese into a response out of the blue one day.
Off your intuition, do you think the same study with Codex 5.2 and Opus 4.5 would see even better results?
I've seen people unable to work at average speed on small features suddenly reach above average output through a llm cli and I could sense the pride in them. Which is at odds with my experience of work.. I love to dig down, know a lot, model and find abstractions on my own. There a llm will 1) not understand how my brain work 2) produce something workable but that requires me to stretch mentally.. and most of the time I leave numb. In the last month I've seen many people expressing similar views.
ps: thanks everybody for the answers, interesting to read your pov
And if you let the AI too loose, as when you try to vibe code an entirely new program, I end up in the situation where in 1 day I have a good prototype and then I can spend easily 5 times as much sorting the many issues and refactoring in order to have it scale to the next features.
Long iteration cycles are taxing
But it does feel less fulfilling I suppose.
Strongly suspect this is simply less efficient than doing it yourself if you have enough expertise.
> Number of Survey Respondents
> Building apps 53
> Testing 1
I think this sums up everybody complaints about AI generated code. Don't ask me to be the one to review work you didn't even check.
Most of the work brought to me gets done before I even think about sitting down to type.
And it's interesting to see the divide here between "pure coder" and "coder + more". A lot of people seem to be in the job to just do what the PM, designer and business people ask. A lot of work is pushing back against some of those requests. In conversations here in HN about "essential complexity" I even see commenters arguing that the spec brought to you is entirely essential. It's not.
We're in the midst of another abstraction level becoming the working layer - and that's not a small layer jump but a jump to a completely different plane. And I think once again, we'll benefit from getting tools that help us specify the high level concepts we intend, and ways to enforce that the generated code is correct - not necessarily fast or efficient but at least correct - same as compilers do. And this lift is happening on a much more accelerated timeline.
The problem of ensuring correctness of the generated code across all the layers we're now skipping is going to be the crux of how we manage to leverage LLM/agentic coding.
Maybe Cursor is TurboPascal.
Should we be trying to put the genie back in the bottle? If not, what exactly are you suggesting?
Even if we all agreed to stop using AI tools today, what about the rest of world? Will everybody agree to stop using it? Do you think that is even a remote possibility?
Software Devs not so much.
There is a huge difference between the two and they are not interchangeable.
Your take is this meme https://knowyourmeme.com/memes/dig-the-fucking-hole.
I. Don't. Care.
I don't even care about those debates outside. Debates about do LLM work and replace programmers? Say they do, ok so what?
I simply have too much fun programming. I am just a mere fullstack business line programmer, generic random replaceable dude, you can find me dime a dozen.
I do use LLM as Stack Overflow/docs replacement, but I always code by hand all my code.
If you want to replace me, replace me. I'll go to companies that need me. If there are no companies that need my skill, fine, then I'll just do this as a hobby, and probably flip burgers outside to make a living.
I don't care about your LLM, I don't care about your agent, I probably don't even care about the job prospects for that matter if I have to be forced to use tools that I don't like and to use workflows I don't like. You can go ahead find others who are willing to do it for you.
As for me, I simply have too much fun programming. Now if you excuse me, I need to go have fun.
(1) already have enough money to survive without working, or
(2) don't realize how hard of a life it would be to "flip burgers" to make a living in 2026.
We live very good lives as software developers. Don't be a fool and think you could just "flip burgers" and be fine.
I also did dry cleaning, cleaning service, deli, delivery guy, etc.
Yup I now have enough money to survive without working.
But I also am very low maintenance, thanks to my early life being raised in harsh conditions.
I am not scared to go back flipping burgers again.
People need to make money to survive, now more than ever. It seems incredibly selfish to wish for that to disappear just so you can "purify" the profession.
I'd at least be more likely to get a boost in impact and ability to affect decision making, maybe.
or something like that
Like I said, I am just a generic replaceable dime a dozen programmer dude.
As with every new tech there's a hell of a lot of noise (plugins, skills, hooks, MCP, LSP - to quote Kaparthy) but most of it can just be disregarded. No one is "behind" - it's all very easy to use.
It's like saying all you need is notepad to develop. It's not wrong, but.. you know.
"I’m on disability, but agents let me code again and be more productive than ever (in a 25+ year career). - S22"
Once Social Security Administration learns this, there goes the disability benefit...
Im in the back-and-forth camp. I expect a lot of interesting UX to develop here. I built https://github.com/backnotprop/plannotator over the weekend to give me a better way to review & collaborate around plans - all while natively integrated into the coding agent harness.
Do it in the way that makes you feel happy, or conforms to organizational standards.
Well
There’s many contexts in which programming a computer well is not important.
I've seen this with code generation tools - developers who treat AI suggestions as magic often struggle when the output doesn't work or introduces subtle bugs. The professionals who succeed are those who understand what the AI is doing, validate the output rigorously, and maintain clear mental models of their system.
This becomes especially important for code quality and technical debt. If you're just accepting AI-generated code without understanding architectural implications, you're building a maintenance nightmare. Control means being able to reason about tradeoffs, not just getting something that "works" in the moment.
Out of curiosity, if I wanted to setup cscope for a bunch of small projects, say dozens of prototypes in their own directory, would it be useful? Too broad?
So essentially what this means is a declarative programming system of overall system behavior.
Not a statistically significant sample size.
> This is a qualitative methods paper, so statistical significance is not relevant.
I have never heard of a "qualitative methods paper" and it sounds like something a researcher would do to push a narrative with "qualitative data" rather than data that could be measured.
Tell me why I am wrong.