Besides, this post has nothing specific to code produced by an LLM, and placing AI in the stated reasons feels completely arbitrary, or is rather a fallacy of our times:
- I reject [AI] code when I can’t explain the approach in my own words.
- I reject [AI] code when the diff is bigger than the problem.
- I reject [AI] code when it introduces abstractions before proving they’re needed.
- I reject [AI] code when it works locally but makes the system harder to reason about.
- I reject [AI] code when I’m trusting the output more than my understanding.
These are the people who spit out an incredible volume of code with AI, to the point reviews simply can’t keep up.
The last person who said this to me works in embedded, where we look at the assembly all the time. Scary.
If you're given two embedded devices and both pass the same testing, how would you tell which one was 100% AI code and which was beautifully handcrafted line by line?
Also, when something invariably doesn’t work (maybe I told Claude “delay 1 sec after each swing of the axe the robot makes if the proximity sensor trips to avoid the puppy that walks across the ax’s path once every month”, and meant to type “2 sec”), I still have to go down to the level of the code sometimes. I’m sure the counter argument is “well then that just means your testing wasn’t good enough”. Sure, but I’ve never seen any project with hardware in the loop where the testing was good enough 100% of the time. Sometimes it’s hard to test once in a month type events in a regression test suite.
FWIW I hover around 80-90% code AI written these days. I still look at every line of code it makes.
However if you’re highly familiar with a domain then LLMs are much less useful.
(For as long as that's true, "software developer" is still a job. It's not clear for how long it will be true.)
Meanwhile, those codebases often require a ton of boilerplate and drudgery to get anything done.
In these spaces it's very easy to read and comprehend AI generated output and review it fairly quickly. So the time savings from dealing with all that boilerplate and conforming with all that existing infrastructure are potentially substantial.
Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.
It's like if I 3D-printed something I haven't modelled myself and the print goes wonky. I don't spend days trying to glue and file it back together. I chuck it in the bin and start a new one.
But if I had handcrafted the same item over multiple days, of course I'd try to salvage it - because there was a sunk cost of me spending time doing it.
It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.
However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.
The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.
The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.
In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.
The plan also solves "I can't explain this code" because you wrote the plan before the build, so you can explain it.
After tracking some internal metrics recently we found plan review costs 0.7 hours on average compared to PR review that costs 16 hours. We rejected 13 out of 165 plans meaning no code was written.
The one gap this doesn't close is that the agent drifts from the plan. We run a separate adversarial check that compares the diff against the approved plan and flags anything the plan didn't specify. That catches scope drift without reading every line.
I'm more interested right now in what does that abstraction look like for AI generated code. Is there some reasonable solution wherein a sandboxed component in the enterprise architecture has various attributes (e.g. the bytes i stuff into this file store component are always the exact bytes i get back from it) confirmed by methods other than a human reading its code? Those methods, are they cheaper, faster, safer than just having a human do it?
If your enterprise architects have to read every line of code in your system today then i'd claim your architecture practices have room to mature. What can derived from that, and in which scenarios, for the purposes of safely leveraging immutable write-only code? I'm not interested in evolving the code (lines of code spent to solve a business problem was never an asset, it was always a cost) if it wasn't hand crafted by a human, i still have the requirements so i can just regenerate the entire thing with the revised requirement.