Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.
Really ? This happens plenty with human testing. Humans aren't general ?
The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.
I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.
Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.
For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks: