>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.
Which top lab was that? What did they try?
>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.
Unfortunately observations support a simpler hypothesis: o3 was trained on sufficient data about ARC-1 that it could solve it well. There is currently insufficient data on ARC-II to solve it therefore o3 can't solve it. No super magickal and mysterious qualitatively different abilities to all models that came before required whatsoever.
Indeed, that is a common pattern in machine learning research: newer models perform better on benchmarks than earlier models not because their capabilities increase with respect to earlier models but because they're bigger models, trained on more data and more compute. They're just bigger, slower, more expensive- and just as dumb as their predecessors.
That's 90% of deep learning research in a nutshell.