Speed at the cost of quality: Study of use of Cursor AI in open source projects (2025) (opens in new tab)

(arxiv.org)

148 pointswek3mo ago81 comments

81 comments

53 comments · 15 top-level

rfw3003mo ago· 15 in thread

Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.

If a module becomes unsustainably complex, I can ask Claude questions about it, have it write tests and scripts that empirically demonstrate the code's behavior, and worse comes to worst, rip out that code entirely and replace it with something better in a fraction of the time it used to take.

That's not to say complexity isn't bad anymore—the paper's findings on diminishing returns on velocity seem well-grounded and plausible. But while the newest (post-Nov. 2025) models often make inadvisable design decisions, they rarely do things that are outright wrong or hallucinated anymore. That makes them much more useful for cleaning up old messes.

joshribakoff3mo ago

Bad code has real world consequences. Its not limited to having to rewrite it. The cost might also include sanctions, lost users, attrition, and other negative consequences you don’t just measure in dev hours

SR2Z3mo ago

Right, but that cost is also incurred by human-written code that happens to have bugs.

In theory experienced humans introduce less bugs. That sounds reasonable and believable, but anyone who's ever been paid to write software knows that finding reliable humans is not an easy task unless you're at a large established company.

2 more replies

GorbachevyChase3mo ago

Bentley Software is proof that you can ship products with massive, embarrassing defects and never lose a customer. I can’t explain enterprise software procurement, but I can guarantee you product quality is not part of that equation.

MeetingsBrowser3mo ago

This only helps if you notice the code is bad. Especially in overlay complex code, you have to really be paying attention to notice when a subtle invariant is broken, edge case missed, etc.

Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.

And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.

phillipclapham3mo ago

This is the most important point in the thread. The study measures code complexity but the REAL bottleneck is cognitive load (and drain) on the reviewer.

I've been doing 10-12 hour days paired with Claude for months. The velocity gains are absolutely real, I am shipping things I would have never attempted solo before AI and shipping them faster then ever. BUT the cognitive cost of reviewing AI output is significantly higher than reviewing human code. It's verbose, plausible-looking, and wrong in ways that require sustained deep attention to catch.

The study found "transient velocity increase" followed by "persistent complexity increase." That matches exactly. The speed feels incredible at first, then the review burden compounds and you're spending more time verifying than you saved generating.

The fix isn't "apply traditional methods" — it's recognizing that AI shifts the bottleneck from production to verification, and that verification under sustained cognitive load degrades in ways nobody's measuring yet. I think I've found some fixes to help me personally with this and for me velocity is still high, but only time will tell if this remains true for long.

4 more replies

i_love_retros3mo ago

> have it write tests

Just make sure it hasn't mocked so many things that nothing is actually being tested. Which I've witnessed.

moregrist3mo ago

I’ve also seen Opus 4.5 and 4.6 churn out tons of essentially meaningless tests, including ones where it sets a field on a structure and then tests that the field was set.

You have to actually care about quality with these power saws or you end up with poorly-fitting cabinets and might even lose a thumb in the process.

1 more reply

camdenreslink3mo ago

I find LLMs get much more prone to making mistakes or missing references when the size or complexity of the code increases. I have a “vibe coded” application that is just for personal use, and I’ll usually create a fresh prompt after a large refactor and ask “were all references to the previous approach removed, and has the application been fully migrated to using the new approach?”

It finds spots it missed during the refactor basically every time.

So I partially agree with you, but I think it takes multiple passes and at least enough understanding to challenge the LLM and ask pointed questions.

duskdozer3mo ago

What happens if you or future developers become unable to access Claude, the proprietary product of Anthropic?

danieljacksonno3mo ago

The open source models are pretty good too now. They are a few months behind, but not more than that. Sure, you still have to host them in the cloud to get enough VRAM to run them - but looking 10+ years into the future the end game here will probably be having a local LLM that runs on your own computer and is more than capable enough to do coding for you.

thobiasn3mo ago

I've been asking myself the same question. Realistically I think it depends a lot on how many providers are available in the future. If you lose access to one you can move to another, its not a single point of failure per say. I think this question gets a lot more relevant if the providers get more monopolized instead of gaining wider spread, so far we've only seen more providers appear.

AlexandrB3mo ago

> Super interesting study. One curious thing I've noticed is that coding agents tend to increase the code complexity of a project, but simultaneously massively reduce the cost of that code complexity.

This is the same pattern I observed with IDEs. Autocomplete and being able to jump to a definition means spaghetti code can be successfully navigated so there's no "natural" barrier to writing spaghetti code.

jwpapi3mo ago

I think thats a fallacy. As of right now there is a point of no return where the complexity cant be broken by the agent itself without breaking more on other things. I’ve seen it before. Agents cheat on tests, break lint and type rules.

I was hoping for it to work, but It didn’t for me.

Still trying to figure out how to balance it.

IsTom3mo ago

> empirically demonstrate the code's behavior

That is completely insufficient for code of any real complexity. All this does is replacing known bugs with unknown bugs.

FuckButtons3mo ago

> but simultaneously massively reduce the cost of that code complexity.

Citation needed. Until proven otherwise complexity is still public enemy #1. Particularly given that system complexity almost always starts causing most of its problems once a project is further along I don’t think we will know anything meaningful about that statement for at least a year.

woeirua3mo ago· 6 in thread

This study's cutoff date was August 2025. I don't think this result is surprising given the level of coding agent ability back then. The whole thing just shows how out-of-date academic publishing is on this subject.

>This yields 806 repositories with adoption dates between January 2024 and March 2025 that are still available on GitHub at the time of data analysis (August 2025).

There were very few people who thought that coding agents worked very well back then. I was not one of them, but I _do_ think they work today.

sentrysapper3mo ago

Evergreen excuses for tech people desperately want to work. I get why, it would give you agency to do other things you WANT to do. I tried reviewing a colleagues agent-generated code and it was practically unreviewable. I watched him blame himself, saying he just needed to adjust a parameter. He tried everything except admit the machine does not conceptionaly understand what he was asking.

BoorishBears3mo ago

we're one more rl run from Codex not trying to satisfy the type checker by replacing an index accessor with a BFS of all the keys in the API response and matching the correct property via regex.

one more, i swear.

staticassertion3mo ago

Eh, I don't know. I mean, are we seeing better models now? Of course. But are they truly leaps and bounds better? No, and I get confused by people saying that they are. They're better but not like... 10x better.

And when people were studying ChatGPT 3.5, everyone would go "Oh, but that wasn't 4!", and when people talk about Opus 4.5 they go "4.6 is so much better!".

My personal position right now is that people are extremely bad at evaluating model output/ changes in model capabilities. Model benchmarks do not reflect the position that models are just 10x better than they were a year ago, but with how people discuss them you'd think that 10x was underselling it.

woeirua3mo ago

Every single objective metric that we have access seems to suggest that they are, and the zeitgeist online seems to suggest that they are, but you're right. Your personal experience trumps all of that.

nicoburns3mo ago

I don't have personal experience, but there seems to be a broad consensus that Opus 4.5 was tipping point between "kinda bad" and "actually kinda useful".

So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).

3 more replies

monkaiju3mo ago

This is the perennial excuse, and I'm sure we'll continue to see it. Folks will say the exact same thing when the current crop of slop-generators have been replaced by a newer ilk

1 more reply

matt_heimer3mo ago· 4 in thread

Yes, it's not surprising that warnings and complexity increased at a higher rate when paired with increased velocity. Increased velocity == increased lines of code.

Does the study normalize velocity between the groups by adjusting the timeframes so that we could tell if complexity and warnings increased at a greater rate per line of code added in the AI group?

I suspect it would, since I've had to simplify AI generated code on several occasions but right now the study just seems to say that the larger a code base grows the more complex it gets which is obvious.

AstroBen3mo ago

"Notably, increases in codebase size are a major determinant of increases in static analysis warnings and code complexity, and absorb most variance in the two outcome variables. However, even with strong controls for codebase size dynamics, the adoption of Cursor still has a significant effect on code complexity, leading to a 9% baseline increase on average compared to projects in similar dynamics but not using Cursor."

scuff3d3mo ago

To add to the person who quotes the relevant part of the study, they also point that the velocity increase disappears after a month or two.

ex-aws-dude3mo ago

That was my thought as well, because obviously complexity increases when a project grows regardless of AI

bensyverson3mo ago

Yeah, I have a more complex project I'm working on with Claude, but it's not that Claude is making it more complex; it's just that it's so complex I wouldn't attempt it without Claude.

AstroBen3mo ago· 3 in thread

They're measuring development speed through lines of code. To show that's true they'd need to first show that AI and humans use the same number of lines to solve the same problem. That hasn't been my experience at all. AI is incredibly verbose.

Then there's the question of if LoC is a reliable proxy for velocity at all? The common belief amongst developers is that it's not.

andai3mo ago

See also

-2000 lines of code

https://news.ycombinator.com/item?id=26387179

This is actually one thing I have found LLMs surprisingly useful for.

I give them a code base which has one or two orders of magnitude of bloat, and ask them to strip it away iteratively. What I'm left with usually does the same thing.

At this point the code base becomes small enough to navigate and study. Then I use it for reference and build my own solution.

kaffekaka3mo ago

Was it Bill Gates who likened LoC to measuring airplane construction progress by weight?

otabdeveloper43mo ago

> They're measuring development speed through lines of code.

Yeah, this is the biggest facepalm.

Didn't we grow out of this idiocy 40 years ago? This shit again? Really?

PeterStuer3mo ago· 3 in thread

Interesting from an historical perspective. But data from 4/2025? Might as well have been last century.

happycube3mo ago

I think the gist of it still applies to even Claude Code w/Opus 4.6.

It's basically outsourcing to mediocre programmers - albeit very fast ones with near-infinite patience and little to no ego.

Miraste3mo ago

It doesn't map well to a mediocre human programmer, I think. It operates in a much more jagged world between superhuman, and inhuman stupidity.

1 more reply

PeterStuer3mo ago

In my experience, 4.6, together with the Claude Code improvements was a non-linear event. A threshold was crossed that forced me to review my complete model of genAI.

If your ideas/studies/experiences with genAI for software development and engineering were from before januari, basically /clear and re-init.

2 more replies

keeda3mo ago· 2 in thread

There are actually quite a few studies out there that look at LLM code quality (e.g. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=LLM+...) and they mostly have similar findings. This reinforces the idea that LLMs still require expert guidance. Note, some of these studies date back to 2023, which is eons ago in terms of LLM progress.

The conclusion of this paper aligns with the emerging understanding that AI is simply an amplifier of your existing quality assurance processes: Higher discipline results in higher velocity, lower discipline results in lower stability (e.g. https://dora.dev/research/2025/) Having strong feedback and validation loops is more critical than ever.

In this paper, for instance, they collected static analysis warnings using a local SonarQube server, which implies that it was not integrated into the projects they looked at. As such these warnings were not available to the agent. It's highly likely if these warnings were fed back into the agent it would fix them automatically.

Another interesting thing they mention in the conclusion: the metrics we use for humans may not apply to agents. My go-to example for this is code duplication (even though this study finds minimal increase in duplication) -- it may actually be better for agents to rewrite chunks of code from scratch rather than use a dependency whose code is not available forcing it to instead rely on natural language documentation, which may or may not be sufficient or even accurate. What is tech debt for humans may actually be a boon for agents.

scuff3d3mo ago

Code duplication is normally referring to duplicate code within the same code base, not writing something yourself instead of using a library.

keeda3mo ago

That's fair, but I suspect the underlying mechanism is the same -- the models prefer re-writing code from scratch rather than looking around for reusable abtsractions, which may exist just a few modules over, or -- for smaller models -- sometimes even in the same file. They're not copy-pasting the code for sure, just regenerating de novo.

This is the most common issue I find, even with the latest models. For normal logic it's not too bad, the real risk is when they start duplicating classes or other abstractions, because those tend to proliferate and cause a mess.

I don't know if it's the training or RL or something intrinsic to the attention mechanism, but these models "prefer" generating new code rather than looking around for and integrating reusable code, unless the functionality is significant or they are explicitly prompted otherwise.

I think this is why AGENTS.md files are getting so critical -- by becoming standing instructions, they help override the natural tendencies of the model.

1 more reply

mentalgear3mo ago· 2 in thread

> We find that the adoption of Cursor leads to a statistically significant, large, but transient increase in project-level development velocity, along with a substantial and persistent increase in static analysis warnings and code complexity. Further panel generalized-method-of-moments estimation reveals that increases in static analysis warnings and code complexity are major factors driving long-term velocity slowdown. Our study identifies quality assurance as a major bottleneck for early Cursor adopters and calls for it to be a first-class citizen in the design of agentic AI coding tools and AI-driven workflows.

So overall seems like the pros and cons of "AI vibe coding" just cancel themselves out.

mort963mo ago

The part you quoted doesn't support your conclusion. Per your quoted paragraph, the benefit of "AI vibe coding" is a large, but transient (i.e temporary) increase in development velocity; while the drawback is a persistent increase in static analysis warnings and code complexity.

To me, this sounds like after the transient increase of velocity has died down, you're left with the same development velocity as you had when you started, but a significantly worse code base.

andai3mo ago

The implication seems to be that if quality assurance is prioritized, the negative impact would be eliminated.

This seems to assume the main cause is the accumulation of defects due to lack of static analysis and testing.

I think a more likely cause is, the code begins to rapidly grow beyond the maintainers' comprehension. I don't think there is a technical solution for that.

1 more reply

duendefm3mo ago· 2 in thread

AI is not perfect sure, one has to know how to use it. But this study is already flawed since models improved a lot since the beginning of 2026.

Eufrat3mo ago

This is not a useful, constructive or meaningful statement.

Attempting to claim the models are the future by perpetually arguing their limitations are because people are using the models wrong or that the argument has been invalidated because the new model fixes it might as well be part of the training data since Claude Opus 3.5.

duendefm3mo ago

No no, I didn't say that at all. I'm just saying that the studies are irrelevant since models got a boost in their competence. I'm not in a fight pro or against llm's, I know how they work and their limitations. But the complexity of the problems they solve increased since opus 4.5 . If you can't admit that, it's your problem.

Also, I'm not blaming users for their shortcomings. I'm just saying they are not perfect but you can get different outcomes according to how you use them.

1 more reply

dalemhurley3mo ago· 1 in thread

I think the issue is people AI assisted code, test then commit.

Traditional software dev would be build, test, refactor, commit.

Even the Clean Coder recommends starting with messy code then tidying it up.

We just need to apply traditional methods to AI assisted coding.

miningape3mo ago

At that point you lose all potential gains from just using AI - it's harder and slower to read and understand a bunch of verbose slop and then clean it up

bisonbear3mo ago

Really interesting study. One thing I keep coming back to is that tests have no way of catching this sort of tech debt. The agent can introduce something that will make you rip your hair out in 6 months, but tests are green...

My theory is that at least some of this is solvable with prompting / orchestration - the question is how to measure and improve that metric. i.e. how do we know which of Claude/Codex/Cursor/Whoever is going to produce the best, most maintainable code *in our codebase*? And how do we measure how that changes over time, with model/harness updates?

faheembm3mo ago

This matches what I've seen building with AI assistance, velocity goes up fast, but you start accumulating complexity debt you didn't consciously design. The difference is intentionality. When the architecture and systems decisions are yours and you're using AI to execute, the complexity stays manageable. When AI drives both the architecture AND the code, that's when this paper's findings kick in hard.

Slav_fixflex3mo ago

Interesting findings. I use AI agents (Claude, Windsurf) exclusively to build production software without being a developer myself. Speed is real but so is context drift – the AI breaks unrelated things while fixing others. Git became essential for me because of this.

felix95273mo ago

The study only looks at what lands in the PR. In my experience a single prompt can trigger 20+ tool calls, most of them reads and greps. The final diff is a tiny fraction of what actually happened. Hard to judge quality without seeing the process.

mellosouls3mo ago

Depends on the nature of the tool I would imagine - eg. Claude Code Terminal (say) would have higher entry requirements in terms of engineering experience (Cursor was sold as newbie-friendly) so I would predict higher quality code than Cursor in a similar survey.

ofc that doesn't take into account the useful high-level and other advantages of IDEs that might mitigate against slop during review, but overall Cursor was a more natural fit for vibe-coders.

This is said without judgement - I was a cheerleader for Cursor early on until it became uncompetitive in value.

chris_money2023mo ago

Now someone do a research study where a summary of this research paper is in the AGENTS.md and let’s see if the overall outcomes are better

j / k navigate · click thread line to collapse

81 comments

53 comments · 15 top-level

rfw3003mo ago· 15 in thread

joshribakoff3mo ago

SR2Z3mo ago

Right, but that cost is also incurred by human-written code that happens to have bugs.

2 more replies

GorbachevyChase3mo ago

MeetingsBrowser3mo ago

This only helps if you notice the code is bad. Especially in overlay complex code, you have to really be paying attention to notice when a subtle invariant is broken, edge case missed, etc.

Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.

And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.

phillipclapham3mo ago

This is the most important point in the thread. The study measures code complexity but the REAL bottleneck is cognitive load (and drain) on the reviewer.

4 more replies

i_love_retros3mo ago

> have it write tests

Just make sure it hasn't mocked so many things that nothing is actually being tested. Which I've witnessed.

moregrist3mo ago

I’ve also seen Opus 4.5 and 4.6 churn out tons of essentially meaningless tests, including ones where it sets a field on a structure and then tests that the field was set.

You have to actually care about quality with these power saws or you end up with poorly-fitting cabinets and might even lose a thumb in the process.

1 more reply

camdenreslink3mo ago

It finds spots it missed during the refactor basically every time.

So I partially agree with you, but I think it takes multiple passes and at least enough understanding to challenge the LLM and ask pointed questions.

duskdozer3mo ago

What happens if you or future developers become unable to access Claude, the proprietary product of Anthropic?

danieljacksonno3mo ago

thobiasn3mo ago

AlexandrB3mo ago

jwpapi3mo ago

I was hoping for it to work, but It didn’t for me.

Still trying to figure out how to balance it.

IsTom3mo ago

> empirically demonstrate the code's behavior

That is completely insufficient for code of any real complexity. All this does is replacing known bugs with unknown bugs.

FuckButtons3mo ago

> but simultaneously massively reduce the cost of that code complexity.

woeirua3mo ago· 6 in thread

>This yields 806 repositories with adoption dates between January 2024 and March 2025 that are still available on GitHub at the time of data analysis (August 2025).

There were very few people who thought that coding agents worked very well back then. I was not one of them, but I _do_ think they work today.

sentrysapper3mo ago

BoorishBears3mo ago

we're one more rl run from Codex not trying to satisfy the type checker by replacing an index accessor with a BFS of all the keys in the API response and matching the correct property via regex.

one more, i swear.

staticassertion3mo ago

And when people were studying ChatGPT 3.5, everyone would go "Oh, but that wasn't 4!", and when people talk about Opus 4.5 they go "4.6 is so much better!".

woeirua3mo ago

nicoburns3mo ago

I don't have personal experience, but there seems to be a broad consensus that Opus 4.5 was tipping point between "kinda bad" and "actually kinda useful".

So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).

3 more replies

monkaiju3mo ago

This is the perennial excuse, and I'm sure we'll continue to see it. Folks will say the exact same thing when the current crop of slop-generators have been replaced by a newer ilk

1 more reply

matt_heimer3mo ago· 4 in thread

Yes, it's not surprising that warnings and complexity increased at a higher rate when paired with increased velocity. Increased velocity == increased lines of code.

Does the study normalize velocity between the groups by adjusting the timeframes so that we could tell if complexity and warnings increased at a greater rate per line of code added in the AI group?

AstroBen3mo ago

scuff3d3mo ago

To add to the person who quotes the relevant part of the study, they also point that the velocity increase disappears after a month or two.

ex-aws-dude3mo ago

That was my thought as well, because obviously complexity increases when a project grows regardless of AI

bensyverson3mo ago

Yeah, I have a more complex project I'm working on with Claude, but it's not that Claude is making it more complex; it's just that it's so complex I wouldn't attempt it without Claude.

AstroBen3mo ago· 3 in thread

Then there's the question of if LoC is a reliable proxy for velocity at all? The common belief amongst developers is that it's not.

andai3mo ago