The swindle goes like this, AI on a good codebase can build a lot of features, you think it’s faster it even seems safer and more accurate on times, especially in domains you don’t know everything about.
This goes in for a while whilst the codebase gets bigger and exploration takes longer and failure rate increases. You don’t want it to be true and try harder so you only stop after it practically became impossible to make any changes.
You look at the code again and there is so much code spaghetti is an understatement it’s the Chinese wall.
You start working…, and you realize what was going on
I deleted 75,000 of 140,000 lines of code and I honestly feel like the 3 months I went hard into agentic coding I wasted and I failed my users by building useless features increasing bugs, losing the mental model of my code and not finding the problems I didn’t know about the kind of hard decisions you only see when you in the code, the stuff that wanders in your mind for days
They seem to be different for LLMs, because would anyone be surprised if they handed summary feature descriptions to some random "developer" you've ever only met online, and got back an absolute dung pile of half-broken implementation?
For some reason, people seem to expect miracles from some machine that they would not expect of other humans, especially not ones with a proven penchant for rambling hallucinations every once in a while.
I'd like to know, ideally from people who've been there, why they think that is. Where does the trust come from?
They can assimilate 100s of thousands of tokens of context in few seconds/minutes and do exceptional pattern matching beyond what any human can do, that's a main factor in why it looks like "miracles" to us. When a model actually solves a long standing issue that was never addressed due to a lack of funding/time/knowledge, it does feel miraculous and when you are exposed to this a couple of times it's easy to give them more trust, just like you would trust someone who provided you a helping hand a couple of times more than at total stranger.
I suppose it's difficult to account for the inconsistency of something able to perform up to standard (and fast!) at one time, but then lose the plot in subtle or not-so-subtle ways the next.
We're wired to see and treat this machine as a human and therefore are tempted to trust it as if it were a human who demonstrated proficiency. Then we're surprised when the machine fails to behave like one.
I have to say, I'm still flabbergasted by the willingness to check out completely and not even keep on top of, and a mental model of, what gets produced. But the mind is easily tempted into laziness, I presume, especially when the fun part of thinking gets outsourced, and only the less fun work of checking is left. At least that's what makes the difference for me between coding and reviewing. One is considerably more interesting than the other, much less similar than they should be, given that they both should require gaining a similar understanding of the code.
It is surprising how bad it is at taking the lead given how effective it is with a much more limited prompt, particularly if you buy in to all the hype that it can take the place of human intelligence. It is capable of applying a incredible amount of knowledge while having virtually no real understanding of the problem.
LLMs also don't solve the much bigger problem of most software engineers having no ability to work with others to clarify requests or offer alternatives. So now bad and/or misunderstood requests can be implemented faster.
Or really the same reason people fall for get rich quick schemes.
I don't understand this. A large codebase should be a collection of small codebases, just like a large city is a collection of small cities. There is a map and you zoom into your local area and work within that scope. You don't need to know every detail of NYC to get a cup of coffee.
Its your responsibility to build a sane architecture that is maintainable. AI doesn't prevent you from doing that, and in fact it can help you do so if you hold the tool correctly.
To speak more directly - every codebase has local reasoning and global reasoning. When looking at a single piece of code that's well-isolated, you can fully understand its behavior "locally" without knowing anything about any other part of the code. But when a piece of code is tightly coupled to many other parts of the codebase, you have to reason globally - you have to understand the whole system to even understand what that one piece of code is doing, because it has tendrils touching the whole system. That's typically what we call spaghetti code.
If you leave an AI to its own devices, it will happily "punch holes", and create shortcuts, through your architecture to implement a specific feature, not caring about what that does to the comprehensibility of the system.
There are software that works like this (e.g. a website's unrelated pages and their logic), but in general composing simple functions can result in vastly non-proportional complexity. (The usual example is having a simple loop, and a simple conditional, where you can easily encode Goldbach or Collatz)
E.g. you write a runtime with a garbage collector and a JIT compiler. What is your map? You can't really zoom in on the district for the GC, because on every other street there you have a portal opening to another street on the JIT district, which have portals to the ISS where you don't even have gravity.
And if you think this might be a contrived example and not everyone is writing JIT-ted runtimes, something like a banking app with special logging requirements (cross cutting concerns) sits somewhere between these two extremes.
When on a tiny project it doesn't matter. However when you have millions of lines of code you have to trust that your code works in isolation without knowing the details.
Oh, great analogy there.
Just like there's almost nothing in common between a large city and a collection of small cities, a large codebase is completely different from a collection of small codebases too.
Mostly because of the same kinds of effects.
Rather than arguing about the specifics, it's easier to point to numerous concrete examples, such as a fairly simple system - which should be easy to implement in 8-15k lines of code, depending on certain choices (I've been writing code long enough to estimate this relatively accurately) - being still-incomplete while approaching 150k lines. These kinds of atrocities are usually economically infeasible in hand-written code, for 2 reasons: 1) the cost to produce that much code is very high, and 2) the cost of maintaining that much code is insurmountable.
I guess you could say that AI is great at generating code that only AI can understand and maintain.
Eg., treating AI code generated as immediately legacy, with tight encapsulation boundaries, well-defined interfaces etc. And integrating in a more manual workflow.
There's a range from single-shot prompts to inline code generation, that will make more sense depending on the problem and where in the code base it is.
Single-shot stuff is going to make more sense for a protyping phase with extensive spec iteration. Once that prototype is in place, you then prob want to drop down into per-module/per-file generation, and be more systematic -- always maintaining a reasonably good mental model at this layer.
I could see value in using it during the prototyping phase, but wouldn’t like to work like you described for a serious project for end users.
I care more about code quality now, because typing no longer limits if I feel like it's worth to refactor something or not.
This seems to me like it requires an impossible level of discipline, judgement and foresight
This is good advice regardless whether you're using AI or not, yet in real life "let's have well-defined boundaries and interfaces" always loses against "let's keep having meetings for years and then ducttape whatever works once the situation gets urgent".
You can have very good diffs and then found that the whole codebase is a collection of slightly disjointed parts.
AI doesn't necessarily have to increase your throughput, it can also serve as a flexible exploration and refactoring tool that will support either later hand crafter code or agentic implementation.
I still have a lot of usage for AI: Exploration, Double-checking me, teaching me. But writing code became very tough for me to accept. Nex-edit autocompletes mainly
I'm ready to give up on having it even review my code at this point. It's been so frustrating. It hallucinates bugs, especially in places where "best practices" are at odds with reality.
Recently it informed me of a bug where it suggested the line of code in question couldn't possibly do anything because on Linux the specific stdlib behaved in X ways, but it was obvious from the line of code that it was running on Windows which doesn't have this problem at all. Of course, it doesn't actually mention that this is an issue on Linux, just that there is a bug here. It vomits up a paragraph of $WORDS explaining why this was a high-priority bug that absolutely needed to be fixed because it was failing in subtle ways. Yet the line of code in question has been running in production, producing exactly the results it is expected to, for ~3 years.
And this is just one simple example, of the many dozens+ of times it has failed this task this year. In that same review run, the agent suggested 3 additional "bugs" or other issues that should be addressed that were all flatly wrong or subjective. I'm at a point of absolute exhaustion with this sort of shit. It's worse than a junior half of the time because of how strongly opinionated it is. And the solution to this sort of problem is an endless amount of configuration and customization that will be forgotten about by all of us over time, leading to who knows what sort of knock-on effects (especially as we migrate from one model to the next). We have a guy on our team who has ~17,000 words in his agent and instructions files, yet he sees nothing wrong with this. I guess he just really loves YAML and Markdown.
There comes a realization, to many engineer’s horror, that AI won’t be able to save them and they will have to manually comprehend and possibly write a ton of code by hand to fix major issues, all while upper management is breathing down their back furious as to why the product has become a piece of shit and customers are leaving to competitors.
The engineers who sink further into denial thrash around with AI, hoping they are a few prompts or orchestrations away from everything be fixed again.
But the solution doesn’t come. They realize there is nothing they can do. It’s over.