this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
"The goal to make longer unattended sessions safe enough to be useful without fully removing the human from the loop. It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself."
>safe enough to be useful without fully removing the human from the loop
This is the fundamental concept for AI usage, assistance and adoption for every fields not only code generation.
Essentially AI including LLM, ML, DL, is just a tool, like any other automation tools operating based on the principle of expert-in-the-loop as safety and quality gatekeeper, for sensible and responsible decision making [1].
[1] Domain expertise has always been the real moat (brethorsting.com) (519 comments):
with enough scaffolding around self-reflectivity and metrics, it will converge.
Do you have infinite money?
It also presents tradeoffs in compute budget. Cycles spent executing large arrays of tests could mean less tokens spent debugging.
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
Maybe I've chosen hardmode to learn C with LLM assistance, plus my pet project turned out to be a bit less trivial then anticipated. But I know that I have to think three times about my choices how to deal with C problems and seeing how a LLM struggles to give reasonable answers is a a huge red flag and forces me to think about it a fourth time.
Doing all this with a fast autonomous workflow with just little user guidance is asking for trouble.
I suspect that the “right” way to use LMs in coding, including accounting for focus, control, and costs is not a settled debate. We probably haven’t even seen the best ideas yet. But I’m really dislike the maximalist approach.
May you speak a little more about how you're approaching this? I was thinking of doing similar
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
it works.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", there is no producer-consumer disbalance or something similar. From what I can see, the author just proposes a structured feedback loop. That's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
I think teams need to be able to write nested workflows that transition between code-led and agent-led, with either supporting human-in-the-loop checkpoints.
Been iterating on what this should look like at our startup (https://www.amika.dev/). Model labs are also improving capabilities here, such as Codex's `/goal` and Claude Code's dynamic workflows[1]
The points about API usage cost still stand, but model intelligence is getting cheaper every month! No need to use the frontier model for every part of the work.
/goal is a dynamic workflow itself, from what I know. Dynamic workflows do not hold the initiative (and can't use any libraries or I/O).
Dynamic workflows do not prevent checkpointing.
I don't see the actual point of your startup, it's a cheap idea - such as most LLM startups out there.
I don't see how models are getting cheaper - I clearly see the opposite trend.
Can you elaborate on what you think causes such a bias? My experience is that Qwen3.6, Claude Sonnet 4.6 and Opus 4.6/4.7 will work as far as they can given direction and a way to test their work. My so-far limited experience with Opus 4.8 is that it does stop somewhat earlier for feedback, but in places where I am glad it is checking assumptions or where I agree with it identifying a change in scope (for example, where the following work deserves a separate commit or merge request). I would call those justified stops rather than unwarranted.
You can't express orchestration in terms of "backpressure" only, I think.
Implement-Review-Repeat loop does not involve backpressure in the strict meaning of the term.
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
- single-piece flow means not making large batches of things and then sending them all downstream at once, but instead working on one thing at a time so downstream has a chance to reject before too much of the wrong thing is produced.
- autonomation (or jidoka) means giving the machine the ability to detect when something is wrong and not continue at that point.
- poka-yoke is a process that forces results to be conformant by construction.
Any and all of these terms would be better than backpressure in this context.
(This made me realise that lean people have been spending decades dealing with the problems we encounter with the new robots that write code. Half of the lean philosophy is about setting up processes and structures that have positive optionality on people's creativity, without undue requirement on their level of responsibility. That's exactly what we want for robots that write code too. We want to capture the benefits of what they do well, without suffering from their innumerable mistakes. But we can't just chastise them for making mistakes, so we have to think the way lean people do.)
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
Oh boy.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
If you put all these checks in your stop hook and your git commit hook, your repo docs can tell your agent that checks will run automatically when it stops work, and it should fix any problems found.
It’s wonderful to reintroduce determinism at the QA end of your process. I find it very calming to know the agent can’t skip or forget to check its work because with hooks the checks are run by the harness.
It absolutely makes sense to have a system in place that allows the code generated by an LLM to be automatically validated but there’s no need to resort to a non-deterministic system for these sort of deterministic pass/fail conditions.
One thing I've been wondering about is how to reliably protect specific portions of the system from unexpected/unnecessary change (for example, a failing test that Claude decides to comment out or rewrite to get it to pass). My only thought for this was to automatically revert test changes during specific portions of the implementation, but that feels overly rigid and potentially prevents things like refactoring code.
- Define the task and the goal, write a short spec document (markdown is fine)
- Point the agent at it in plan mode and have it write the plan to disk with phases. Iterate on its plan if necessary here and now.
- Have each agent tackle a phase and have it update it as a living document (switch models if some phases are more difficult than others)
- Clear and repeat until done
I've never had to overcomplicate this and it's worked both on enterprise-scale projects and personal projects. I am not sure what I'm missing - if anything.
By all means add tons of quality gates to your SDLC pipeline. But thinking about slowness purely for the sake of slowness will not solve your problems.
My gut reaction, as a professional developer, to my (previous) company's AI mandates was an instinctive "wait but..." -- it didn't logic out to me. Now that I have much more AI experience under my belt, I understand the tension, it's a superpower and net-net ok so more features and more "stuff" will be built. But it's a very hard thing to balance. It's always been a bad idea for a company to position themselves as the one with more "stuff" in it.
My agent forces this workflow by disabling modifications outside the coding step.
I added looping to this not too long ago. https://github.com/hsaliak/std_slop/blob/main/docs/mail-loop...
This gives me the best of both worlds, hand curated reviews and automation. I often get the best quality if I do both, with an agent doing a pass first.
Next, Vercel, already handle this correctly. It takes special effort to violate "least surprise" here. Cmd-click on a link, should open it in a new tab.
It does appear to be an issue with SimpleAnalytics, now Adobe's,
onclick="saAutomatedLink(this, 'outbound'); return false;"
Free debugging of how the site tweaks, breaks, the 30 year consensus web standard behavior.Good sites, good blogs, *don't override onclick for links.* Or handle it correctly. I'll leave an issue on the github.
Between your footer, and dotfiles repo, OP does seem to appreciate standards & norms, in principle.
The more guardrails you provide the more it cheats.
AI is like a wild animal that needs to do something, and it takes a fair bit of work to corner it. And only when it's cornered and at the point of giving up, can you then offer it a way out.
If you don't do what I said, I can guarantee it's fooling you somehow.
A pre-commit hook has been wonderful. Sure, you can add instructions, but pre-commit hooks are where you want to put the guards.
Called it rik, and it's on GitHub if anyone's interested checking it.
https://github.com/puraxyz/puraxyz/blob/main/docs/paper/main...
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.
The main kind of pressure I'm feeling is the pressure of the giant AI, GPU & datacenter companies with their insane capital expenditure and circular deals, trying to get enough people to develop an expensive reliance on their service. And the more expensive, the better, so don't just pay for the LLM to code for you, have another LLM interact with the first LLM and pay double, treble, 5x or whatever. Then you can get the most refined slop.
Fuck, we’re so cooked.
Increased complexity of your systems. Increased pipelines of your system.
You might reduce the likelihood of errors, but at an overproportinal cost of time it takes to complete (which some might argue is irrelevant, but has the cost of human context), and with an way higher time and focus needed for all bugs that the system doesnt work.
You’ll have to fix adapt and maintain all your verification layers, because just because you set them up they are not perfect.
Your testing pipeline becomes incredible slow and you need to maintain it as well.
It’s tremendously weaker than a hands-on approach.
I’ve written this exact same article in January and since then completely switched my position.
Good luck on everyone trying this. You shuffling your own grave and waste time.