Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.
Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.
This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.