Writing _all_ (waves hands around various llm wrapper git repos) these frameworks and harnesses, built on top of ever changing models sure doesn't feel sensible.
I don't know what the best way of using these things is, but from my personal experience, the defaults get me a looong way. Letting these things churn away overnight, burning money in the process, with no human oversight seems like something we'll collectively look back at in a few years and laugh about, like using PHP!
Not if you are an AI gold rush shovel salesman.
From the article:
> I've run Claude Code workshops for over 100 engineers in the last six months
As I like this allegory really much, AI is (or should be) like and exoskeleton, should help people do things. If you step out of your car putting it first in drive mode, and going to sleep, next day it will be farther, but the question is, is it still on road
Yes, this matches my experience with codebases before AI was a thing.
What I could see happening in your scenario is the company suffers from diminishing return as every task becomes more expensive (new feature, debugging session, library update, refactoring, security audit, rollouts, infra cost). They could also end up with an incoherent gigantic product that doesn't make sense to their customer.
Both pitfall are avoidable, but they require focus and attention to detail. Things we still need humans for.
Bad idea. Use another agent to do automatic review. (And a third agent writing tests.)
Don't forget the architecting and orchestrating agent too!
I actually feel that things I built 15 years ago in PHP were better than anything I am trying to achieve with modern things that gets outdated every 6 months.
It's not more work; it's a convergence of roles. BA/PO/QA/SWE are merging.
AI has automated aspects of those roles that have made the traditional separation of concerns less desirable. A new hybrid role is emerging. The person writing these acceptance criteria can be the one guiding the AI to develop them.
So now we have dev-BAs or BA-devs or however you'd like to frame it. They're closer to the business than a dev might have been or closer to development than a BA might have been. The point is, smaller teams are able to play wider now.
It literally is. You're spending weeks of effort babysitting harnesses and evaluating models while shipping nothing at all.
Before anyone gets too confused, I love tests. They're great. They help a lot. But to believe they prove correctness is absolutely laughable. Even the most general tests are very narrow. I'm sure they help LLMs just as they help us, but they're not some cure all. You have to think long and hard about problems and shouldn't let tests drive your development. They're guardrails for checking bonds and reduce footguns.
Oh, who could have guessed, Dijkstra wrote about program completeness. (No, this isn't the foolishness of natural language programming, but it is about formalism ;)
https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...
The price you pay for tests is that they need to be written and maintained. Writing and maintaining code is much more expensive than people think.
Or at least it used to be. Writing code with claude code is essentially free. But the defect rate has gone up. This makes TDD a better value proposition than ever.
TDD is also great because claude can fix bugs autonomously when it has a clear failing test case. A few weeks ago I used claude code and experts to write a big 300+ conformance test suite for JMAP. (JMAP is a protocol for email). For fun, I asked claude to implement a simple JMAP-only mail server in rust. Then I ran the test suite against claude's output. Something like 100 of the tests failed. Then I asked claude to fix all the bugs found by the test suite. It took about 45 minutes, but now the conformance test suite fully passes. I didn't need to prompt claude at all during that time. This style of TDD is a very human-time efficient way to work with an LLM.
I think of it more as "locking" the behavior to whatever it currently is.
Either you do the red-green-with-multiple-adversarial-sub-agents -thing or just do the feature, poke the feature manually and if it looks good then you have the LLM write tests that confirm it keeps doing what it's supposed to do.
The #1 reason TDD failed is because writing tests is BOORIIIING. It's a bunch of repetition with slight variations of input parameters, a ton of boilerplate or helper functions that cover 80% of the cases, but the last 20% is even harder because you need to get around said helpers. Eventually everyone starts copy-pasting crap and then you get more mistakes into the tests.
LLMs will write 20 test cases with zero complaints in two minutes. Of course they're not perfect, but human made bulk tests rarely are either.
Especially for backend software and also for tools, seems like automated tests can cover quite a lot of use cases a system encounters. Their coverage can become so good that they'll allow you to make major changes to the system, and as long as they pass the automated tests, you can feel relatively confident the system will work in prod (have seen this many times).
But maybe you're separating automated testing and TDD as two separate concepts?
You don't need to believe this to practice TDD. In fact I challenge you to find one single mainstream TDD advocate who believes this.
Sounds like a lack of tests for the correct things.
Our society is obsessed with work. Work will never end. If things become easier we just do more of them. Whether putting all our efforts into recycling things created by those that came before is good for us will remain to be seen.
He still has to water the plants on his own. Its just that it costs him quite a bit when all of that could he mamaged with an alarm to remind him to water plants.
Tooling around llms are a natural next step that will become your default one day.
They're all just tools. You decide how to use them.
Engineer at tech firms and WebShops writing WordPress plugins for single clients where Squarespace doesn't cut it.
Is AI another field of people or is it killing one or both of those. TBD
lmao, chuckled
I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.
Add to that I have worked on many projects that take more than 20 minutes to fully build and run tests... unfortunately. And I would consider that part of the job of implementing a feature, and to reduce cycles I have to take.
After the "green" signal I will manually review or send off some secondary reviews in other models. Is it wasteful? Probably. But its pretty damn fun (as long as I ignore the elephant in the room.)
I still think that we, programmers, having to pay money in order to write code is a travesti. And I'm not talking about paying the license for the odd text editor or even for an operating system, I'm talking about day-to-day operations. I'm surprised that there isn't a bigger push-back against this idea.
No. But it is noteworthy. A lot of what one previously needed a SWE to do can now be brute forced well enough with AI. (Granted, everything SWEs complained about being tedious.)
From the customer’s perspective, waiting for buggy code tomorrow from San Francisco, buggy code tonight from India or buggy code from an AI at 4AM aren’t super different for maybe two thirds of use cases.
Only if you ignore everything they generate. Look at all the comments saying that the agent hallucinates a result, generates always-passing tests, etc. Those are absolutely true observations -- and don't touch on the fact that tests can pass, the red/green approach can give thumbs up and rocket emojis all day long, and the code can still be shitty, brittle and riddled with security and performance flaws. And so now we have people building elaborate castles in the sky to try to catch those problems. Except that the things doing the catching are themselves prone to hallucination. And around we go.
So because a portion of (IMO always bad, but previously unrecognized as bad) coders think that these random text generators are trustworthy enough to run unsupervised, we've moved all of this chaotic energy up a level. There's more output, certainly, but it all feels like we've replaced actual intelligent thought with an army of monkeys making Rube Goldberg machines at scale. It's going to backfire.
I've never met those people. I've met a LOT of PM who tried. I've met a LOT of entrepreneur who also tried. They never cared, nor even understand, code. They only cared about "value" (and they are not necessarily wrong about it) so now they can "produce" something that does what need until it doesn't. When that's the case then they inexorably go back to someone else (might be a SWE, ironically enough, but might also be someone else like them they shift responsibility to, for money).
Brute force works until you have to backtrack, then it becomes prohibitively expensive until one has to actually grok the problem landscape. It's amazing for toy projects though, maybe.
The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.
It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.
I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.
Reward hacking is very real and hard to guard against.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
Is it really about rewards? Im genuinely curious. Because its not a RL model.
You can use coverage information, and you should cull your tests every once in a while I guess.
Property based testing also helps.
[1] https://simonwillison.net/guides/agentic-engineering-pattern...
https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter
> When asking Claude Code to write tests, I find they are inevitably coupled to implementation details, mockist, brittle, and missing coverage.
Interestingly, I haven't noticed any of that so far, using Claude Code on a new-ish project (couple 10k loc). However, I also went out of my way in my CLAUDE.md to instruct it to write functional code, avoid side effects / push side effects to the shell (functional core, imperative shell), avoid mocks in tests, etc. etc.
You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).
● Separation of concerns. No single agent plans, implements, and verifies. The agent that writes the code is never the agent that checks it.
https://benhouston3d.com/blog/the-rise-of-test-theater
You have to actively work against it.
I've written about this and have a POC here for those interested: https://www.joegaebel.com/articles/principled-agentic-softwa...
> Most teams don't [write tests first] because thinking through what the code should do before writing it takes time they don't have.
It's astonishing to me how much our industry repeats the same mistakes over and over. This doesn't seem like what other engineering disciplines do. Or is this just me not knowing what it looks like behind the curtain of those fields?
I like to think that people writing actual mission critical software try their absolute best to get it right before shipping and that the rest our industry exists in a totally separate world where a bug in the code is just actually not that big of a deal. Yeah, it might be expensive to fix, but usually it can be reverted or patched with only an inconvenience to the user and to the business.
It’s like the fines that multinational companies pay when breaking the law. If it’s a cost of doing business, it’s baked into the price of the product.
You see this also in other industries. OSHA violations on a residential construction site? I bet you can find a dozen if you really care to look. But 99% of the time, there are no consequences big enough for people to care so nobody wears their PPE because it “slows them down” or “makes them less nimble”. Sound familiar?
With other engineering professions, all projects are like that. You cannot "deploy a bridge to production" to see what happens and fix it after a few have died
So now people just ignore broken tests.
> Claude, please implement this feature.
> Claude, please fix the tests.
The only thing we've gained from this is that we can brag about test coverage.
1. one agent writes/updates code from the spec
2. one agent writes/updates tests from identified edge cases in the spec.
3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code.
(repeat 1 and/or 2 then 3 until all tests pass)
Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).
Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.
You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.
Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?
Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.
Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.
Code review is a skill, as is reading code. You're going to quickly learn to master it.
> It's like 20k of line changes over 30-40 commits.
You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.
> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.
But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.
Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).
IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.
Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.
The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.
i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)
What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.
Then, what comes next feels less like a new software practice and more like a new religion, where trust has to replaces understanding, and the code is no longer ours to question.
- Highly paid FAANG engineers that are working on side projects / startup ideas, and will pay whatever it takes. They have the means to do so.
- Startups with funds.
- Regular tech workers that are allowed to use the company card.
It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj
But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.
If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/
I can't understand the mindset that would lead someone not to have realized this from the beginning.
TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.
TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”
Thank you for coming to my TED^H^H^H TDD talk.
I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.
I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.
I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.
Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."
When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"
I insisted that it change the culling configuration and then it worked fine.
These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.
Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.
People are so enamored with how fast the 20% part is now and yes it’s amazing. But the 80% part by time (designing, testing, reviewing, refactoring, repairing) still exists if you want coherent systems of non-trivial complexity.
All the old rules still apply.
Not a rhetoric question. Trillion token burners and such.
If you are on a sinking ship would you not do your best to position yourself?
Or do you see your actions morally equivalent to others regardless of scale?
Well, a) it's a hobby, and b) this is still a free country/free society.
One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.
[1] https://anthonysciamanna.com/2019/08/22/the-continuous-value...
When I graduated in 2012 it was pushed everywhere, including my uni so my undergrad thesis was done in Java.
Everyone was learning it, certifying, building things on top of other things.
EJB, JPA, JTA, JNDI, JMS and JCA.
And them more things to make it even more powerful with Servlets, JSP, JSTL, JSF.
Many companies invested and built various application servers, used by enterprises by this day.
Every engineer I've met said Java is server side future, don't bother with other tech. You'll just draw data schema, persistence mapping, business logic and ship it.
I switched to C++ after Bjarne's talk I attended in 2013. I'm glad I did although I never worked as a software engineer. Following passion and going deep into technology was a bliss for me, the difference between my undergrad Java, Master C++ and Rust PhD is like a kids toy and a real turboprop engine.
Don't follow the hype - it will go away and you'll be left with what you've invested into.
If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.
LLMs don't actually have a reward system like some other ML models.
maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct
But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.
The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.
#!python
print(“fix needed: method ABC needs a return type annotation on line 45”
import os
os.exit(2)
Claude Code will show that output to the model. This lets you enforce anything from TDD to a ban on window.alert() in code - deterministically.
This can be the basis for much more predictable enforcement of rules and standards in your codebase.
Once you get used to code based guardrails, you’ll see how silly the current state of the art is: why do we pack the context full of instructions, distract the model from its task, then act all surprised when it doesn’t follow them perfectly!
The idea is to measure how much of the invariant string or byte structure a new version still covers.
In the context of agents that churn out code overnight, the review fatigue impacting human agents. One way to prioritize review effort is to treat the codebase's own history as a "corpus" and flag commits that deviate structurally from past patterns—like an injection test for code. If an agent adds 20k lines that are mostly boilerplate, the coverage of established code patterns might drop, signalling something worth a closer look. It's not a substitute for tests or semantic verification, but a cheap way to surface outliers.
We tested this on Alpine apk-tools across nine releases: the 3.23 rewrite dropped from 71‑80% string coverage to 40%, and from 23‑28% byte n‑gram coverage to 13.6%—detected automatically just from size distribution. Applied to code, you could imagine a dashboard where every PR gets a "historical coverage" score; when it drops, the human knows to zoom in.
It complements existing verification: a signature tells you who signed, not whether the artifact is consistent with its own version history. Same for AI‑generated PRs: tests may pass, but if the code looks nothing like what came before, that’s a useful signal.
Details here if anyone's curious: https://lf3.gitlab.io/blog/binary-string-mask/
Whenever I coded any serious solution as a technical co-founder, every single day there was a major new debate about the product direction. Though we made massive 'progress' and built out a whole new universe in software, we haven't yet managed to find product market fit. It's like constant tension. If the intelligence of two relatively intelligent humans with a ton of experience and complimentary expertise isn't enough to find product-market-fit after one year, this gives you an idea about how high the bar is for an AI agent.
It's like the problem was that neither me nor my domain expert co-founder who had been in his industry for over 15 years had a sufficiently accurate worldview about the industry or human psychology to be able to produce a financially viable solution. Technically, it works perfectly but it just doesn't solve anyone's problem.
So just imagine how insanely smart AI has to be to compete in the current market.
Maybe you could have 100 agents building and promoting 100 random apps per day... But my feeling is that you're going to end up spending more money on tokens and domain names then you will earn in profits. Maybe deploy them all under the same domain with different subdomains? Not great for SEO... Also, the market for all these basic low-end apps is going to be extremely competitive.
IMO, the best chance to win will be on medium and complex systems and IMO, these will need some kind of human input.
The cost concern is real but manageable. The key is routing models by task. Complex reasoning gets Opus, routine work gets Sonnet, mechanical tasks get Haiku. Not everything needs the expensive model.
The quality concern is the bigger one. What people miss about autonomous agents is that "running unsupervised" doesn't mean "running without guardrails." Each of my agents has explicit escalation rules, a security agent that audits the others, and a daily health report system that catches failures. The agents that work best are the ones with built-in disagreement, not the ones that just pass things through.
Wrote up the full architecture here if anyone's curious about the multi-agent coordination patterns: https://clelp.com/blog/how-we-built-8-agent-ai-team
This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.
They crash. The interesting question isn't how to prevent that — it's how to make it not matter.
The gnarliest failure I hit: my agents share a knowledge graph through an MCP memory server. When multiple agents fire parallel tool calls (say, create_entities and create_relations in the same batch), you get a classic read-modify-write race. Both operations read the same JSONL state, both write back the full graph plus their additions. Second write obliterates the first. No error, no warning — data just vanishes. Sometimes the write gets interrupted mid-line and you end up with a half-written JSON line that breaks the parser on next load.
My fix was a local fork of the memory server with three things: an async mutex to serialize writes, atomic writes (write to .tmp then rename), and auto-repair on load that skips corrupt lines and deduplicates. But the meta-point is that on the BEAM, this entire class of bug doesn't exist. A GenServer processes messages sequentially from its mailbox — mutual exclusion is the execution model, not something you bolt on with a mutex. Supervision trees restart crashed processes in microseconds. Each process has its own heap, so one agent going haywire can't corrupt another's state.
Erlang/OTP solved this in 1986 for telecom switches that needed 99.999% uptime. The pattern maps almost perfectly to AI agents: many concurrent, stateful, failure-prone processes that need to communicate without taking each other down.
I wrote a detailed post about this with actual code and the full corruption story: https://dev.to/setas/why-erlangs-supervision-trees-are-the-m...
You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.
To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.
The hardest bug I hit was a shared JSONL memory store: two agents wrote at once, one update silently overwrote the other, and sometimes the file ended up partially corrupted. I fixed it with a mutex and atomic writes, but that mostly taught me I was working against my runtime.
The reason I keep ending up back at Erlang/OTP is that supervision and isolated processes are the default model, not a patch. If agents are going to run while you sleep, restart behavior matters more than clever prompts.
Telling Claude to turn your notes into a blog post with simple, terse language does not hide your own lack of taste.
Outage is the easy failure mode. I can work around a service that's up 80% of the time, but is 100% correct. A service that's up 100% of the time but is 80% correct is useless.
Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.
Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.
I've been playing around with agent orchestration recently and at least tried to make useful outputs. The biggest differences were having pipelines talk to each other and making most of the work deterministic scripts instead of more LLM calls (funnily enough).
Made a post about it here in case anyone is interested about the technicals: https://www.frequency.sh/blog/introducing-frequency/
I've been doing some DIY/citizen science type agent orchestration as well: https://blog.unratified.org/2026-03-06-receiving-side-agent-...
Not quite to the same scale, but I share the same sentiment - working through scripts instead of the LLM is an important key, I think.
The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.
It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.
How do you imokement the scoring layer and when and how is it invoked?
What he describes is like that. Just that the plan step is suggesting docs, not writing actual docs.
Seems things still haven't changed in half a century
https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...
Since you have to test that manually anyway, you can have AI write the code first; you test it; if it's the right result, you tell AI this is correct, so write test cases for this result.
Seems like QA is the new prompt engineering
> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.
> I care about this. I don't want to push slop
They clearly didn't care about that. They only cared about non stop lines of code generation and shipping anything fast. Otherwise they wouldn't need weeks to realise that they weren't reading or testing this code - it's obvious from the outset.
Maybe their approach to this changed and that's fine, but at the beginning they very much did not care and I feel people only keep saying that do because otherwise they'd need to be the one to admit the emperor isn't wearing clothes.
That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”
[1] https://code.claude.com/docs/en/devcontainer
If you want to try it just ask Claude to set it up for your project and review it after.
It will probably comply, and at least if it does change the tests you can always revert those files to where you committed them
One could even make zero-knowledge test development this way.
- privacy policy links to marketing company `beehiiv.com`. the blog author doesn't show up there.
- the profile picture url is `.../Generated_Image_March_03__2026_-_1_55PM.jpg.jpeg`
i didn't dig or read further.
I want to subscribe, but I never end up reading newsletters if they land in my email inbox.
I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.
On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.
The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.
Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.
How is this even possible? Am I the only SWE who feels like the easiest part of my job is writing code and this was never the main bottleneck to PR?
Before CC I'd probably spent around 20-30% of my day just writing code into an IE. That's now maybe 10% now. I'd probably also spend 20-30% of my day reading code and investigating issues, which is now maybe 10-15% of my day now using CC to help with investigation and explanations.
But there's a huge part of my day, perhaps the majority it, where I'm just thinking about technical requirements, trying to figure out the right data model & right architecture given those requirements, thinking about the UX, attending meetings, code reviews, QA, etc, etc, etc...
Are these people who are spitting out code literally doing nothing but writing code all day without any thought so now they're seeing 4-5x boosts in output?
For me it's probably made me 50% more efficient in about 40-50% of my work. So I'm probably only like 20-25% more efficient overall. And this assumes that the code I'm getting CC to produce is even comparable to my own, which in my experience it's not without significant effort which just erodes any productivity benefit from the production of code.
If your developers are raising 5x more PRs something is seriously wrong. I suspect that's only possible if they're not thinking through things and just getting CC to decide the requirements, come up with the architecture, decide on implementation details, write the code and test it. Presumably they're also not reviewing PRs, because if they were and there is this many PRs being raised then how does the team have time to spit out code all day using CC?
People who talk about 5x or 10x productivity boosts are either doing something wrong, or just building prototype. As someone who has worked in this industry for 20 years, I literally don't understand how what some people describe can even being happening in functional SWE teams building production software.
I don't think AI will ever solve this problem. It will never be more than a tool in the arsenal. Probably the best tool, but a tool nonetheless.
Good luck doing that in any company that does something meaningful. I can't believe anybody can seriously be ok with such a workflow, except maybe for your little pet project at home.
If you don't review the result, who is going to want to use or even pay for this slop?
Reviewing is the new bottleneck. If you cannot review any more code, stop producing new code.
Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).
Otherwise, what's the point?
I feel the whole industry is having its "Look ma! no hands!" moment.
Time to mature up, and stop acting like sailing is going where the seas take you.
These are fundamentals of CS that we are forgetting as we dismantle all truth and keep rocketing forward into LLM psychosis.
> I care about this. I don't want to push slop, and I had no real answer.
The answer is to write and understand code. You can't not want to push slop, and also want to just use LLMs.
Code Review: https://news.ycombinator.com/item?id=47313787
If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.
1. Write tons of documentation first. I.e. NASA style, every singe known piece of information that is important to implementation. As it's a rewrite of legacy project, I know pretty much everything I need, so there is very little ideas validation/discovery in the loop for that stage. Documentation is structured in nested folders and multiple small .md files, because its amount already larger than Claude Code context (still fits into Gemini). Some of the core design documents are included into AGENTS.md(with symlink to GEMINI/CLAUDE mds)
For that particular project I spent around 1.5 months writing those docs. I used Claude to help with docs, especially based on the existing code base, but the docs are read and validated by humans, as a single source of truth. For every document I was also throwing Gemini and Codex onto it for analyzing for weaknesses or flaws (that worked great, btw).
2. TDD at it's extreme version. With unit tests, integration tests, e2e, visual testing in Maestro, etc. The whole implementation process is split in multiple modules and phases, but each phase starts with writing tests first. Again, as soon as test plan ready, I also throw it on Gemini and Codex to find flaws, missed edge cases, etc. After implementing tests, one more time - give it to Gemini/Codes to analyze and critique.
3. Actual coding. This part is the fastest now especially with docs and tests in place, but it's still crucial to split work into manageable phases/chunks, and validate every phase manually, and ocassionaly make some rounds of Gemini/Codex independently verifying if the code matches docs and doesn't contain flaws/extra duplication/etc.
I never let Claude to commit to git. I review changes quickly, checking if the structure of code makes sense, skimming over most important files to see if it looks good to me (i.e. no major bullshit, which, frankly, has never happened yet) and commit everything myself. Again, trying to make those phases small enough so my quick skim-review still meaningful.
If my manual inspection/test after each phase show something missing/deviating, first thing I ask is "check if that is in our documentation". And then repeat the loop - update docs, update/add tests, implement.
The project is still in progress, but so far I'm quite happy with the process and the speed. In a way, I feel that "writing documentation" and "TDD" has always been a good practice, but too expensive given that same time could've been spent on writing actual code. AI writing code flipped that dynamics, so I'm happy to spend more time on actual architecting/debating/making choices, then on finger tapping.