undefined | Better HN

0 pointswg05d ago0 comments

Snake oil. Good to read for sure. Seems all plausible too. But snake oil nevertheless.

Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.

These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.

That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.

Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.

0 comments

keeda5d ago

Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.

> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.

Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)

What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.

And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.

jazzypants5d ago

I think the placebo effect might be a decent comparison. It works most of the time, and you don't worry about it as long as you fully believe in its efficacy. However, once the illusion is shattered, the positive effects are diminished, and you can never fully trust the solution again.

intended5d ago

> has a pretty high chance of working.

for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.

crimsoneer5d ago

Not only "has a high chance of working", but you can pay more to make it more reliable. It really is striking trying to run a harness openClaw thing on a smaller or quantised model, really makes you realise how much we take for granted from SOTA models that was totally impossible just a year ago, in terms of complex, generally reliable tool use.

j455d ago

Pretty high chance isn’t what the intent or impression the end user often has.

kergonath5d ago

Indeed, and it is a complicated problem to solve. A GUI or CLI can hide footguns or make them less likely to be misused. But an AI agent is perfectly happy to use a wrecking ball to put a nail without any second thought or confirmation.

1 more reply

vidarh5d ago

Humans also drop any hard requirements you specify regularly, and similarly require review. Nevertheless we manage to increase reliability of human output through processes and reviews, and most of the methods we use for harnesses are taken from experience with how to reduce reliability issues in humans, who are notoriously difficult to ensure delivers reliably.

kaashif5d ago

The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.

I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.

vidarh5d ago

Sure, when that is possible. However, there are lots of processes we don't know how to automate in a deterministic way. Hence the vast amount of investment in building organisations of people with mechanism to make peoples output more reliable through structure, reviews, and so on.

Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.

2 more replies

jnpnj5d ago

it's strange to see software engineers using skills aka human description of small scripts instead of scripting things directly. often there were cli / tools / libraries to do what a skill does for many years. maybe it's culture issue, people who enjoy automation / devops / predictability will naturally help themselves, but other people just want to "delegate" and be done without trying.

1 more reply

cortesoft5d ago

Everything you say is all possible, and in theory I agree with you.

However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.

But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.

wg0OP5d ago

We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.

Give it a few more months and I'm sure you'll see some of what I see if not all.

I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.

cortesoft5d ago

I have been doing this for 6 months or so now, and I am not sure that even if you have a lot more experience than me that it would make your assessment more accurate, since that just means you have more experience with prior generations of the models. What I have experienced is that the AI has been getting better and better, and is making fewer and fewer mistakes.

Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.

The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.

Quarrel5d ago

> LLMs aren't perfect rule following machines is the fundamental problem here

I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.

Having a framework to work within, whether you are an LLM or a human, can be helpful.

1 more reply

saidnooneever5d ago

i think it depend on your goals and also your preference / expectation how your experience with LLMs is. i dont mind if they hallucinate. even if i have mental model of code i wont write it myself perfectly either.

the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)

albedoa5d ago

> Give it a few more months

By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.

kergonath5d ago

> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.

Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.

But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.

selimthegrim4d ago

Do you have any information on skills you've found useful here?

kergonath3d ago

Not really, unfortunately. I took some inspiration from existing skills, mostly in the official GitHub repo https://github.com/agentskills/ . But mostly I had to come up with them myself. I try to use Claude to help but it was not that useful.

kajman5d ago

I hope the only reason people are pretending these markdown suggestions are a "workflow" is fear that a more structured approach will be obsolete by the time it's polished. I can't imagine the pace of innovation with the underlying models will stay like this forever.

I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.

raincole5d ago

Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.

peterbell_nyc5d ago

Helps if you both hand to original agent as strong guidance and then to an adversarial agent as a quality reviewer. The adversarial agent is more likely tro loop the work back if it fails the validation criteria.

I do find that just asking the same agent to do and check it's own work is not particularly reliable.

moomin5d ago

This is like saying a +5 sword is useless because you still miss on a one. We’ve got to think about expected outcomes. Because if ahe’s merging five solid PRs to your three, loudly complaining about the one she saw was rubbish and threw away.

SubiculumCode5d ago

All these points apply to human devs as well. The test is not infallibility but magnitude

j16sdiz5d ago

A slot machine isn't snake oil.

Slot machine give you rewards when star aligns, snake oil never do :)

blitzar5d ago

All this said, I quite like the mental model of documenting a simple process, and I suspect our future ai overlords will find it useful that I have a series of md files that outline my preferences and processes for certain tasks.

I am not however going to share any of this with work colleagues and make myself redundant.

Chris20485d ago

> That leaves only one option not reliable but more reliable nevertheless: Human review and oversight.

Couldn't non-manual oversight also help e.g. sandboxes?

chaostheory5d ago

I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.

j / k navigate · click thread line to collapse

0 comments

keeda5d ago

Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.

> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.

What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.

jazzypants5d ago

intended5d ago

> has a pretty high chance of working.

for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.

crimsoneer5d ago

j455d ago

Pretty high chance isn’t what the intent or impression the end user often has.

kergonath5d ago

1 more reply

vidarh5d ago

kaashif5d ago

The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.

vidarh5d ago

Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.

2 more replies

jnpnj5d ago

1 more reply

cortesoft5d ago

Everything you say is all possible, and in theory I agree with you.

wg0OP5d ago

We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.

Give it a few more months and I'm sure you'll see some of what I see if not all.

I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.

cortesoft5d ago

Quarrel5d ago

> LLMs aren't perfect rule following machines is the fundamental problem here

I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.

Having a framework to work within, whether you are an LLM or a human, can be helpful.

1 more reply

saidnooneever5d ago

albedoa5d ago

> Give it a few more months

By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.

kergonath5d ago

> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.

selimthegrim4d ago

Do you have any information on skills you've found useful here?

kergonath3d ago

kajman5d ago

raincole5d ago

Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.

peterbell_nyc5d ago

I do find that just asking the same agent to do and check it's own work is not particularly reliable.

moomin5d ago

SubiculumCode5d ago

All these points apply to human devs as well. The test is not infallibility but magnitude

j16sdiz5d ago

A slot machine isn't snake oil.

Slot machine give you rewards when star aligns, snake oil never do :)

blitzar5d ago

I am not however going to share any of this with work colleagues and make myself redundant.

Chris20485d ago

> That leaves only one option not reliable but more reliable nevertheless: Human review and oversight.

Couldn't non-manual oversight also help e.g. sandboxes?

chaostheory5d ago

I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.

j / k navigate · click thread line to collapse