undefined | Better HN

0 pointsYeGoblynQueenne1y ago0 comments

>> The reason these tasks require fluid intelligence is because they were designed this way -- with task uniqueness/novelty as the primary goal.

That's in no way different than claiming that LLMs understand language, or reason, etc, because they were designed that way.

Neural nets of all sorts have been beating benchmarks since forever, e.g. there's a ton of language understanding benchmarks pretty much all saturated by now (GLUE, SUPERGLUE ULTRASUPERAWESOMEGLUE ... OK I made that last one up) but passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Failing a benchmark also doesn't mean anything. A few years ago, at the first Kaggle competition, the entries were ad-hoc and amateurish. The first time a well-resourced team tried ARC (OpenAI) they ran roughshod over it and now you have to make a new one.

At some point you have to face the music: ARC is just another benchmark, destined to be beat in good time whenever anyone makes a concentrated effort at it and still prove nothing about intelligence, natural or artificial.

0 comments

szvsw1y ago

I mostly agree with what your are saying but…

> passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

Not agreeing or disagreeing or asking with skepticism. Genuinely asking what your position is here, since it seems like your comment eventually leads to the conclusion that it is unknowable whether a system external to yourself understands language, or, if it is possible, then only in a purely qualitative way, or perhaps purely in a Stewart-style-pornographic-threshold-test - you’ll know it when you see it.

I don’t have any problem if that’s your position- it might even be mine! I’m more or less of the mindset that debating whether artificial systems can have certain labels attached to them revolving around words like “understanding,” “cognition,” “sentience” etc is generally unhelpful, and it’s much more interesting to just talk about what the actual practical capabilities and functionalities of such systems are on the one hand in a very concrete, observable, hopefully quantitative sense, and how it feels to interact with them in a purely qualitative sense on the other hand. Benchmarks can be useful in the former but not the latter.

Just curious where you fall. How would you recommend we approach the desire to understand whether such systems can “understand language” or “solve problems” etc etc… or are these questions useless in your view? Or only useful in as much as they (the benchmarks/tests etc) drive the development of new methodologies/innovations/measurable capabilities, but not in assigning qualitative properties to said systems?

YeGoblynQueenneOP1y ago

>> Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

I don't know and I don't have an opinion. I know that tests that claimed to measure language understanding, historically, haven't. There's some literature on the subject if you're curious (sounds like you are). I'd start here:

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Emily M. Bender, Alexander Koller

https://aclanthology.org/2020.acl-main.463/

Quoting the passage that I tend to remember:

>> While large neural LMs may well end up being important components of an eventual full-scale solution to human-analogous NLU, they are not nearly-there solutions to this grand challenge. We argue in this paper that genuine progress in our field — climbing the right hill, not just the hill on whose slope we currently sit —depends on maintaining clarity around big picture notions such as meaning and understanding in task design and reporting of experimental results.

fchollet1y ago

The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.

ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.

YeGoblynQueenneOP1y ago

>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

Which top lab was that? What did they try?

>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.

Unfortunately observations support a simpler hypothesis: o3 was trained on sufficient data about ARC-1 that it could solve it well. There is currently insufficient data on ARC-II to solve it therefore o3 can't solve it. No super magickal and mysterious qualitatively different abilities to all models that came before required whatsoever.

Indeed, that is a common pattern in machine learning research: newer models perform better on benchmarks than earlier models not because their capabilities increase with respect to earlier models but because they're bigger models, trained on more data and more compute. They're just bigger, slower, more expensive- and just as dumb as their predecessors.

That's 90% of deep learning research in a nutshell.

bubblyworld1y ago

I'm sorry, but what observations support that hypothesis? There were scores of teams trying exactly that - training LLMs directly on Arc-AGI data - and by and large they achieved mediocre results. It just isn't an approach that works for this problem set.

To be honest your argument sounds like an attempt to motivate a predetermined conclusion.

1 more reply

j / k navigate · click thread line to collapse

0 comments

szvsw1y ago

I mostly agree with what your are saying but…

> passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.

Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

YeGoblynQueenneOP1y ago

>> Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)

Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

Emily M. Bender, Alexander Koller

https://aclanthology.org/2020.acl-main.463/

Quoting the passage that I tend to remember:

fchollet1y ago

The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

YeGoblynQueenneOP1y ago

>> The first time a top lab spent millions trying to beat ARC was actually in 2021, and the effort failed.

Which top lab was that? What did they try?

>> ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before.

That's 90% of deep learning research in a nutshell.

bubblyworld1y ago

To be honest your argument sounds like an attempt to motivate a predetermined conclusion.

1 more reply

j / k navigate · click thread line to collapse