I was impressed until I read the caveat about the high-compute version using 172x more compute.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
So what? I’m serious. Our current level of progress would have been sci-fi fantasy with the computers we had in 2000. The cost may be astronomical today, but we have proven a method to achieve human performance on tests of reasoning over novel problems. WOW. Who cares what it costs. In 25 years it will run on your phone.
So your claim for optimism here is that something today that took ~10^22 floating point operations (based on an estimate earlier in the thread) to execute will be running on phones in 25 years? Phones which are currently running at O(10^12) flops. That means ten orders of magnitudes of improvement for that to run in a reasonable amount of time? It's a similar scale up in going from ENIAC (500 flops) to a modern desktop (5-10 teraflops).
It's not so much the cost as much the fact that they got a slightly better result by throwing 172x more compute per/task. The fact that it may have cost somewhere north of $1 million simply helps to give a better idea of how absurd the approach is.
It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.
But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?