Their study still shows something interesting, and quite surprising. But if you choose to extrapolate from this specific setting and say coding assistants don't work in general then that's not scientific and you need to be careful.
I think the studyshould probably decrease your prior that AI assistants actually speed up development, even if developers using AI tell you otherwise. The fact it feels faster when it is slower is super interesting.
Being armed with that knowledge is useful when thinking about my own productivity, as I know that there's a risk of me over-estimating the impact of this stuff.
But then I look at https://github.com/simonw which currently lists 530 commits over 46 repositories for the month of December, which is the month I started using Opus 4.5 in Claude Code. That looks pretty credible to me!
It (along with the hundreds of billions in investments hinging on it), explains the legions of people online who passionately defend their "system". Every gambler has a "system" and they usually earnestly believe it is helping them.
Some people even write popular (and profitable!) blogs about playing slots machines where they share their tips and tricks.
We know LLMs instruction follow meaningfully and relatively consistently; we know they are in context learners and also pull from their context window for knowledge; we also know that prompt phrasing and especially organization can have a large effect on their behavior in general; we know from first principles that you can improve the reliability of their results by putting them in a loop with compilers / linters / tests because they do actually fix things when you tell them to. None of this is equivalent to a gambler's superstitions. It may not be perfectly effective, but neither are a million other systems and best practices and paradigms in software.
Also, it doesn't "use" anything. It may be a feature of the program but it isn't intentionally designed that way.
Also who sits around rerunning the same prompt over and over again to see if you get a different outcome like its a slot machine? You just directly tell it to fix whatever was bad about the output and it does so. Sometimes initial outputs have a larger or smaller amount of bad, but still. It isn't really analogous to a slot machine.
Also, you talk as if the whole "do something -> might work / might not, stochastic to a degree, but also meaningfully directable -> dopamine rush if it does; if not goto 1" loop isn't inherent to coding lol
Simon - you are an outlier in the sense that basically your job is to play with LLMs. You don't have stakeholders with requirements that they themselves don't understand, you don't have to go to meetings, deal with a team, shout at people, do PRs etc., etc. The whole SDLC/process of SWE is compressed for you.
Something that's a little relevant to how I work here is that I deliberately use big-team software engineering methods - issue trackers, automated tests, CI, PR code reviews, comprehensive documentation, well-tuned development environments - for all of my personal projects, because I find they help me move faster: https://simonwillison.net/2022/Nov/26/productivity/
But yes, it's entirely fair to point out that my use of LLMs is quite detached from how they might be used on large team commercial projects.
I'm not going to browse every commit in that repo, but half of the projects were created in december. The rest are either a few months old or less than a year.
This is not representative of the industry.
I liked the way they did that study and I would be interested to see an updated version with new tools.
I'm not particularly sceptical myself and my guess is that using Opus 4.5 would probably have produced a different result to the one in the original study.