Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.
That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.
Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)
What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.
And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.
for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.
I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.
Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.
However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.
But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.
Give it a few more months and I'm sure you'll see some of what I see if not all.
I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.
Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.
The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.
I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.
Having a framework to work within, whether you are an LLM or a human, can be helpful.
the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)
By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.
Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.
But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.
I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.
I do find that just asking the same agent to do and check it's own work is not particularly reliable.
Slot machine give you rewards when star aligns, snake oil never do :)
I am not however going to share any of this with work colleagues and make myself redundant.
Couldn't non-manual oversight also help e.g. sandboxes?