For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
The core of the issue is that LLMs are sycophants, they want to make the user happy above all. The most important thing is to make sure what you are asking the LLM to do is correct from the beginning. I’ve found the highest value activity is the in the planning phase.
When I have gotten good results with Claude Code, it’s because I spent a lot of time working with it to generate a detailed plan of what I wanted to build. Then by the time it got to the coding step, actually writing the code is trivial because the details have all been worked out in the plan.
It’s probably not a coincidence that when I have worked in safety critical software (DO-178), the process looks very similar. By the time you write a line of code, the requirements for that line have been so thoroughly vetted that writing the code feels like an afterthought.
I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass
also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing
It happens all the time, even when I only scan the code or simply run it and use it. It's uncanny how many such "smells" I find even with the most trivial applications. Sometimes its replies in Codex or Claude Code are enough to trigger it.
These are mistakes only a very (very) inexperienced developer would make.
Or in other words: Test only guarantees their own result, not the code. The value of the test is because you know the code is trying to solve the general problem, not the test’s assertions.
I recognize this discrepancy where review effort becomes more than the coding itself. I don't think I could sustain that for long.
I'd consider shipping LLM generated code without review risky. Far riskier than shipping human-generated code without review.
But it's arguably faster in the short run. Also cheaper.
So we have a risk vs speed to market / near term cost situation. Or in other words, a risk vs gain situation.
If you want higher gains, you typically accept more risk. Technically it's a weird decision to ship something that might break, that you don't understand. But depending on the business making that decision, their situation and strategy, it can absolutely make sense.
How to balance revenue, costs and risks is pretty much what companies do. So that's how I think about this kind of stuff. Is it a stupid risk to take for questionable gains in most situations? I'd say so. But it's not my call, and I don't have all the information. I can imagine it making sense for some.
This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.
For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.
Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding
Let me guess, you've never worked in a real production environment?
When your software supports 8, 9, 10 or more zeroes of revenue, "trash the old and create new" are just about the scariest words you can say. There's people relying on this code that you've never even heard of.
Really good post about why AI is a poor fit in software environments where nobody even knows the full requirements: https://www.linkedin.com/pulse/production-telemetry-spec-sur...
You may think you are being sarcastic, but I guarantee that a significant percentage of developers think that both the following are true:
a) They will never need to write code again, and
b) They are some special snowflake that will still remain employed.
I've worked at places where I've trusted everyone on my team to the extent that most PRs got only a quick glance before getting a "LGTM". On the flipside, I've also worked on teams where every person was a different kind of liability with the code that they pushed, and for those teams I implemented every linting / pre-commit / testing tool possible that all needed to pass inspection (including human review) before any code arrived on production.
A year ago, AI was like that latter team I mentioned -- something I had to check, double check, and correct until I was happy with what it produced. Over the past 6 months, it's gotten closer (but still fairly far away) from the former team I mentioned -- I have to correct it about 10% of the time, whereas for most things it gets it right.
The fact that AI produces a much _larger_ volume of code than the average engineer is perhaps slightly concerning, but I don't see it much differently than code at large companies. Does every Facebook engineer review every junior engineer's pull request to make sure bad code doesn't slip in?
That isn't to say I'm for letting AI go wild with code -- but I think if at worse we consider AI to be a junior engineer we need to reign in with static analysis tools / linters / testers etc, we will probably be able to mitigate a lot of the downside.
1) Humans were never held accountable, really
Outside of a few regulated industries, the worst that happens to an engineer who pushes negligent code is that they get fired. But after that happens, what actually changes? The organizational structure of the company that allowed the employee to push bad code still exists.
2) Humans will still be held accountable
If a human (managing a fleet of AI agents, let's say) ends up deploying bad code to production, they won't be able to point to the AI agent and say "it was them that did it!" -- it will still be the human at the end of the line that is held responsible.
When even Linus Torvalds compliments AI code (ref: https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Fa...) I think we can say he wouldn't have said that about any junior engineer.
That's not to say it won't ship bugs, but so does any engineer (junior or senior). It's up to you as to what level of tooling you surround the AI with (automated testing / linting / etc), but at the very least it doesn't also hurt to have that set up anyways (automated tests have helped prevent senior devs from shipping bad code too).
So now I prompted this: "can you generate a fizzbuzz implementation in Cairn that showcases as much as possible the TEST/PROVE/VERIFY characteristics of the language? "
Producing this (working) monstrosity (can't paste here, it's 200+ lines of crazy): https://gist.github.com/cairnlang/a7589de126b14e50a53b9bdc28...
The features of both are quite orthogonal. Cairn is a general purpose language with features that help in writing probably working code. Mog is more like "let's constraint our features so bad code can't do much but trade that for good agent ergonomy".
Cairn is a crazy sprawling idea, Mog is a little attempt at something limited but practical.
Mog seems like something someone has thought about. No one has thought about Cairn, it's pure LLM hallucination, the fact that it exists and can do a lot of stuff it's just the result of someone (me) not knowing when a joke has gone too far.
But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.
Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.
We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.
The agent that writes the tests must have read-only access to the spec & the API. It MUST NOT have access to the implementation, even to read it.
The agent that writes the implementation must have read-only access to the spec. It MUST NOT have access to the tests implementation, only to the output report from running them.
This is a PITA to manage with classic UNIX permissions, but is doable with ACLs (`setfacl`/`getfacl`). Actually getting the agent processes to run as different users in an IDE setting instead of a CLI is not supported out of the box by any of the major vendors AFAICT, so IMO they're not really fit-for-purpose.
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
https://softwaredoug.com/blog/2026/03/10/the-tests-are-the-c...
No side effects is a hefty constraint.
Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).
When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.
Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?
Good on the author for trying. Correctness is hard.
...It's one that does what a specific set of humans want. There's no other useful definition. One man's feature is another's bug.
It logically follows that there must be a human review step. How else would you know what the human wants, with sufficient detail?
Otherwise, there's an infinite number of undesired programs with passing test suites that AI can generate for you.
No test catches any of that. Code works, tests pass, database is wide open.
It's not even a correctness problem. It's that the LLM never thought about rate limiting, CORS headers, CSRF tokens, a sane .gitignore — because nobody asked it to. Those are things devs add from muscle memory, from getting burned. The AI has no scars.
I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:
"It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.
"We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.
"But we shouldn’t."
Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.
[1]: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.