a small harness that stores text files and manages context could be useful, otherwise you lose all ability to measure that skill (and that's important because it represents real world use cases on large code bases)
arc agi isnt testing a models ability to store files and code things. its testings its ability to reason through puzzles given the same information as a human
if you tested my ability to reason and you gave me some challenging problems that involved arithmetic, it might be a better test if you gave me a scratch pad so I don't mess up the reasoning parts by failing arithmetic.