undefined | Better HN

0 pointsfamouswaffles3mo ago0 comments

>"Making up for" a poor score on one test with an excellent score on another would be the opposite of generality.

Really ? This happens plenty with human testing. Humans aren't general ?

The score is convoluted and messy. If the same score can say materially different things about capability then that's a bad scoring methodology.

I can't believe I have to spell this out but it seems critical thinking goes out the window when we start talking about machine capabilities.

0 comments

3 comments · 1 top-level

daveguy3mo ago· 2 in thread

Just because humans are usually tested in a particular way that allows them to make up for a lack of generality with an outstanding performance in their specialization doesn't mean that is a good way to test generalization itself.

Apparently someone here doesn't know how outliers affect a mean. Or, for that matter, have any clue about the purpose of the ARC-AGI benchmark.

For anyone who is interested in critical thinking, this paper describes the original motivation behind the ARC benchmarks:

https://arxiv.org/abs/1911.01547

famouswafflesOP3mo ago

>Apparently someone here doesn't know how outliers affect a mean.

If the concern is that easy questions distort the mean, then the obvious fix is to reduce the proportion of easy questions, not to invent a convoluted scoring method to compensate for them after the fact. Standardized testing has dealt with this issue for a long time, and there’s a reason most systems do not handle it the way ARC-AGI 3 does. Francois is not smarter than all those people, and certainly neither are you.

This shouldn't be hard to understand.

daveguy3mo ago

How do you define "easy question" for a potential alien intelligence? The solution, like most solutions when dealing with outliers, in my opinion, is to minimize the impact of outliers.

1 more reply

j / k navigate · click thread line to collapse