@simonw explains how hilariously misguided that paper is in one of the top comments, and how it doesn't apply remotely to a real agent harness. Plus it's not even clearly relevant here, because the model isn't trying to regurgitate the original document, but generate a new one, and there are guardrails to put it back on track in the form or a compiler and tests. Also, the test suite is very thorough, and pre-existing, and the vast majority passes already. This is skepticism for the sake of it.