Claude 3.5-sonnet (latest) is barely able to stay coherent on 500 LOC files, and easily gets tripped up when there are several files in the same directory.
I have tried similarly with o1-preview and 4o, and gemini pro...
If google is using a 5M token context window LLM with 100k+ token-output trained on all the code that is not public... then I can believe this claim.
This just goes to show how critical of an issue this is that these models are behind closed doors.