undefined | Better HN

0 pointssimonw18h ago0 comments

I'm suspicious of their results with regards to tool usage.

It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.

They claim that tool use didn't help, which surprised me... but they also said:

> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.

And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!

The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.

They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.

Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...

The relevant prompt fragment is:

  You can approach the task in whatever
  way you find most effective:
  programmatically or directly
  by writing files

As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.

I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.

0 comments

ofjcihen13h ago

I agree with most of what you wrote except for this:

>Frequent LLM users already know not to do that.

And I think that’s the biggest problem. Amidst the current push to utilize LLMs across orgs and groups there are a large (if even say majority) of people that are using them every day but who have never approached anything as technical as a “harness” before let alone an entire setup.

For them the behavior mentioned here is a major issue.

Sprotch11h ago

Exactly - I am a lawyer and we are told to use dedicated AI products as much and however we want. There will be errors made

rockskon9h ago

Much to the often-reported chagrin of judges across the country.

AlienRobot5h ago

Exactly. When I use a scissor, I don't want the scissor to not work just because I'm not a "frequent scissor user," and then get told by someone who makes their breakfast with scissors that I'm doing it wrong. Most people will not be "frequent" anything users.

TeMPOraLnew17m ago

Most people also understand that, because they're not "frequent" users of a thing, they absolutely suck at using it, and set their expectations accordingly. In particular, they realize that doing anything non-trivial with the thing requires them to spend some learning and practice time, or asking/hiring a "frequent" user to do it for them.

So the reasonable response to being told you're holding your scissors wrong is to realize that yes, you most likely are holding your scissors wrong[0], and ask the other person for advice (or just to do the thing), or look up a YouTube video and learn, or sign up to a class, or such.

Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.

[0] - There's much more to it than one would think.

pcwelder1h ago

It's worth noting that Claude Code itself doesn't use the `insert` tool. (It also uses custom edit tool not the suite's predefined str_replace)

Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.

Back in the day when SOTA models would do lazy coding like `// ... rest of the code ...`, full rewrite wasn't easy. Search/replace was fast, efficient and without the lazy coding. However, it came with slight accuracy drop.

Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.

frabcus44m ago

I've tested this extensively in a workflow (not agentic) context, and you're right, the underlying models are both good at full rewrite of code files, and at doing search/replace.

They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.

It's true that the models 2 years ago would sometimes make errors in whole rewrite - e.g removing comments was fairly common. But I've never seen one randomly remove one character or anything like that. These days they're really good.

Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.

kristjansson13h ago

Only sort of related, but I would love to see a harness with ed as the primary file editing / reading tool. Half the bash Claude runs seems to be sed anyway, having some state persist in ed would seem to help.

What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!

Art96814h ago

Any rando can publish research nowadays. It means nothing. Just like "X country published N research papers last year". It is noise. In a world where it was required to attach age, experience level, and country of origin to every comment, research paper, or post on the internet, it would shatter the conviction we mistakenly have towards the information we receive.

This team is inexperienced and it shows.

The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.

genxy9h ago

Yeah, this is a bit of a strawman of an LLM task.

On editing tasks, one should only allow programmatic editing commands, the text shouldn't flow through the LLM at all. The LLM should analyze the text and emit commands to achieve a feedback directed goal.

threethirtytwo18h ago

People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.

The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.

But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.

shahbaby14h ago

> People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity.

OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.

threethirtytwo12h ago

But no one in this thread addressed the inaccuracy of the experiment. The experiment did not test the actuality of HOW LLMs are used in reality.

So that is definitively a biased interpretation. This is independent of how accurate my POV or your POV is on whether LLMs degrade documents. I am simply saying the experiment conducted is COMPLETELY DIFFERENT from how LLMs AND humans edit papers.

ActionHank9h ago

> a human will DO worse then a 25% degradation.

* than

threethirtytwo8h ago

See that’s an example of degradation by a human. Not even an LLM wil make that kinda mistake.

tieTYT12h ago

> a human will DO worse then a 25% degradation

As I was reading this article, a similar thought occurred to me: "I wonder if that's better or worse than a human?" Unfortunately, there was no human baseline in this study. That said, there are studies that compare LLM to human performance. Usually, humans perform much better (like 5-7x better) at long-running tasks.

In other words, a human would probably do better than an LLM on this task.

Humans lose to LLMs in narrow, well-specified text/symbolic reasoning tasks where the model can exploit breadth, speed, and search. Usually, the LLM performed ~15% better than humans, but I saw studies that were as high as 80%. To my surprise, these studies were usually about "soft skills" like creativity and persuasion.

threethirtytwo12h ago

You can do a baseline study right now. Read this entire thread and make an edit of changing every E to an I.

Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.

Boom there's your baseline. I can simulate the result in my head.

Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.

alansaber14h ago

The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunately

ActionHank9h ago

It could also be that much like most large orgs now you've made LLMs your entire personality, so you don't see the inherent bias.

Most LLM users who are not touching code are certainly not going to be using a harness. They're going to take all the documents, slam all those tokens into the context window, see they have only used 500k out of their 1M tokens and say "summarize".

skybrian7h ago

Wouldn't they be more likely to give ChatGPT access to a Google Drive folder or some such? The tools the agent has for editing documents will be whatever the app they used implemented.

j / k navigate · click thread line to collapse

0 comments

ofjcihen13h ago

I agree with most of what you wrote except for this:

>Frequent LLM users already know not to do that.

For them the behavior mentioned here is a major issue.

Sprotch11h ago

Exactly - I am a lawyer and we are told to use dedicated AI products as much and however we want. There will be errors made

rockskon9h ago

Much to the often-reported chagrin of judges across the country.

AlienRobot5h ago

TeMPOraLnew17m ago

Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.

[0] - There's much more to it than one would think.

pcwelder1h ago

It's worth noting that Claude Code itself doesn't use the `insert` tool. (It also uses custom edit tool not the suite's predefined str_replace)

Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.

Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.

frabcus44m ago

I've tested this extensively in a workflow (not agentic) context, and you're right, the underlying models are both good at full rewrite of code files, and at doing search/replace.

They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.

Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.

kristjansson13h ago

What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!

Art96814h ago

This team is inexperienced and it shows.

The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.

genxy9h ago

Yeah, this is a bit of a strawman of an LLM task.

threethirtytwo18h ago

People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.

shahbaby14h ago

> People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity.

OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.

threethirtytwo12h ago

But no one in this thread addressed the inaccuracy of the experiment. The experiment did not test the actuality of HOW LLMs are used in reality.

ActionHank9h ago

> a human will DO worse then a 25% degradation.

* than

threethirtytwo8h ago

See that’s an example of degradation by a human. Not even an LLM wil make that kinda mistake.

tieTYT12h ago

> a human will DO worse then a 25% degradation

In other words, a human would probably do better than an LLM on this task.

threethirtytwo12h ago

You can do a baseline study right now. Read this entire thread and make an edit of changing every E to an I.

Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.

Boom there's your baseline. I can simulate the result in my head.

Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.

alansaber14h ago

The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunately

ActionHank9h ago

It could also be that much like most large orgs now you've made LLMs your entire personality, so you don't see the inherent bias.

skybrian7h ago

Wouldn't they be more likely to give ChatGPT access to a Google Drive folder or some such? The tools the agent has for editing documents will be whatever the app they used implemented.

j / k navigate · click thread line to collapse