It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
You can approach the task in whatever
way you find most effective:
programmatically or directly
by writing files
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.