Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.