undefined | Better HN

0 pointsthereitgoes4561mo ago0 comments

The people recruited weren’t experts. I can imagine it’s straightforward to find humans (such as those that play many video games) that can score >100% on this benchmark.

0 comments

Imnimo1mo ago

So, if you look at the way the scoring works, 100% is the max. For each task, you get full credit if you solve in a number of steps less than or equal to the baseline. If you solve it with more steps, you get points off. But each task is scored independently, and you can't "make up" for solving one slowly by solving another quickly.

Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.

daveguy1mo ago

The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.

famouswaffles1mo ago

>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.

Really ? This happens plenty with human testing. Humans aren't general ?

The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.

I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.

daveguy1mo ago

Just because humans are usually tested in a particular way that allows them to make up for a lack of generality with an outstanding performance in their specialization doesn't mean that is a good way to test generalization itself.

Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.

For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:

https://arxiv.org/abs/1911.01547

1 more reply

j / k navigate · click thread line to collapse

0 comments

Imnimo1mo ago

daveguy1mo ago

famouswaffles1mo ago

>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.

Really ? This happens plenty with human testing. Humans aren't general ?

The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.

I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.

daveguy1mo ago

Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.

For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:

https://arxiv.org/abs/1911.01547

1 more reply

j / k navigate · click thread line to collapse