If a module becomes unsustainably complex, I can ask Claude questions about it, have it write tests and scripts that empirically demonstrate the code's behavior, and worse comes to worst, rip out that code entirely and replace it with something better in a fraction of the time it used to take.
That's not to say complexity isn't bad anymore—the paper's findings on diminishing returns on velocity seem well-grounded and plausible. But while the newest (post-Nov. 2025) models often make inadvisable design decisions, they rarely do things that are outright wrong or hallucinated anymore. That makes them much more useful for cleaning up old messes.
In theory experienced humans introduce less bugs. That sounds reasonable and believable, but anyone who's ever been paid to write software knows that finding reliable humans is not an easy task unless you're at a large established company.
Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.
And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.
I've been doing 10-12 hour days paired with Claude for months. The velocity gains are absolutely real, I am shipping things I would have never attempted solo before AI and shipping them faster then ever. BUT the cognitive cost of reviewing AI output is significantly higher than reviewing human code. It's verbose, plausible-looking, and wrong in ways that require sustained deep attention to catch.
The study found "transient velocity increase" followed by "persistent complexity increase." That matches exactly. The speed feels incredible at first, then the review burden compounds and you're spending more time verifying than you saved generating.
The fix isn't "apply traditional methods" — it's recognizing that AI shifts the bottleneck from production to verification, and that verification under sustained cognitive load degrades in ways nobody's measuring yet. I think I've found some fixes to help me personally with this and for me velocity is still high, but only time will tell if this remains true for long.
Just make sure it hasn't mocked so many things that nothing is actually being tested. Which I've witnessed.
You have to actually care about quality with these power saws or you end up with poorly-fitting cabinets and might even lose a thumb in the process.
It finds spots it missed during the refactor basically every time.
So I partially agree with you, but I think it takes multiple passes and at least enough understanding to challenge the LLM and ask pointed questions.
This is the same pattern I observed with IDEs. Autocomplete and being able to jump to a definition means spaghetti code can be successfully navigated so there's no "natural" barrier to writing spaghetti code.
I was hoping for it to work, but It didn’t for me.
Still trying to figure out how to balance it.
That is completely insufficient for code of any real complexity. All this does is replacing known bugs with unknown bugs.
Citation needed. Until proven otherwise complexity is still public enemy #1. Particularly given that system complexity almost always starts causing most of its problems once a project is further along I don’t think we will know anything meaningful about that statement for at least a year.
Does the study normalize velocity between the groups by adjusting the timeframes so that we could tell if complexity and warnings increased at a greater rate per line of code added in the AI group?
I suspect it would, since I've had to simplify AI generated code on several occasions but right now the study just seems to say that the larger a code base grows the more complex it gets which is obvious.
The conclusion of this paper aligns with the emerging understanding that AI is simply an amplifier of your existing quality assurance processes: Higher discipline results in higher velocity, lower discipline results in lower stability (e.g. https://dora.dev/research/2025/) Having strong feedback and validation loops is more critical than ever.
In this paper, for instance, they collected static analysis warnings using a local SonarQube server, which implies that it was not integrated into the projects they looked at. As such these warnings were not available to the agent. It's highly likely if these warnings were fed back into the agent it would fix them automatically.
Another interesting thing they mention in the conclusion: the metrics we use for humans may not apply to agents. My go-to example for this is code duplication (even though this study finds minimal increase in duplication) -- it may actually be better for agents to rewrite chunks of code from scratch rather than use a dependency whose code is not available forcing it to instead rely on natural language documentation, which may or may not be sufficient or even accurate. What is tech debt for humans may actually be a boon for agents.
This is the most common issue I find, even with the latest models. For normal logic it's not too bad, the real risk is when they start duplicating classes or other abstractions, because those tend to proliferate and cause a mess.
I don't know if it's the training or RL or something intrinsic to the attention mechanism, but these models "prefer" generating new code rather than looking around for and integrating reusable code, unless the functionality is significant or they are explicitly prompted otherwise.
I think this is why AGENTS.md files are getting so critical -- by becoming standing instructions, they help override the natural tendencies of the model.
So overall seems like the pros and cons of "AI vibe coding" just cancel themselves out.
To me, this sounds like after the transient increase of velocity has died down, you're left with the same development velocity as you had when you started, but a significantly worse code base.
This seems to assume the main cause is the accumulation of defects due to lack of static analysis and testing.
I think a more likely cause is, the code begins to rapidly grow beyond the maintainers' comprehension. I don't think there is a technical solution for that.
Then there's the question of if LoC is a reliable proxy for velocity at all? The common belief amongst developers is that it's not.
-2000 lines of code
https://news.ycombinator.com/item?id=26387179
This is actually one thing I have found LLMs surprisingly useful for.
I give them a code base which has one or two orders of magnitude of bloat, and ask them to strip it away iteratively. What I'm left with usually does the same thing.
At this point the code base becomes small enough to navigate and study. Then I use it for reference and build my own solution.
Yeah, this is the biggest facepalm.
Didn't we grow out of this idiocy 40 years ago? This shit again? Really?
My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?
Traditional software dev would be build, test, refactor, commit.
Even the Clean Coder recommends starting with messy code then tidying it up.
We just need to apply traditional methods to AI assisted coding.
>This yields 806 repositories with adoption dates between January 2024 and March 2025 that are still available on GitHub at the time of data analysis (August 2025).
There were very few people who thought that coding agents worked very well back then. I was not one of them, but I _do_ think they work today.
one more, i swear.
And when people were studying ChatGPT 3.5, everyone would go "Oh, but that wasn't 4!", and when people talk about Opus 4.5 they go "4.6 is so much better!".
My personal position right now is that people are extremely bad at evaluating model output/ changes in model capabilities. Model benchmarks do not reflect the position that models are just 10x better than they were a year ago, but with how people discuss them you'd think that 10x was underselling it.
So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).
ofc that doesn't take into account the useful high-level and other advantages of IDEs that might mitigate against slop during review, but overall Cursor was a more natural fit for vibe-coders.
This is said without judgement - I was a cheerleader for Cursor early on until it became uncompetitive in value.
It's basically outsourcing to mediocre programmers - albeit very fast ones with near-infinite patience and little to no ego.
If your ideas/studies/experiences with genAI for software development and engineering were from before januari, basically /clear and re-init.
Attempting to claim the models are the future by perpetually arguing their limitations are because people are using the models wrong or that the argument has been invalidated because the new model fixes it might as well be part of the training data since Claude Opus 3.5.
Also, I'm not blaming users for their shortcomings. I'm just saying they are not perfect but you can get different outcomes according to how you use them.