> There is no evidence to support their claims
Do you make a habit of not presuming even basic competence? You believe that Anthropic left the task running for hours, got a score back, and never bothered to examine the solution? Not even out of curiosity?
Also if it was cheating you'd expect the final score to be unbelievably low. Unless you also suppose that the LLM actively attempted to deceive the human reviewers by adding extra code to burn (approximately the correct number of) cycles.