undefined | Better HN

0 pointstymonPartyLate1y ago0 comments

Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)

0 comments

tikkun1y ago

They're only allowed 2-3 guesses per problem. So even though yes it generates many candidates, it can't validate them - it doesn't have tool use or a verifier, it submits the best 2-3 guesses. https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...

TrapLord_Rhodo1y ago

Chain of thought can entirely self validate. The OP is saying the LLM is acting like a photon, evaluate all possible solutions and choosing the most "Right" path. not quoting the OP here but my initial thought is that is does seem quite wasteful.

the LLM only gets two guesses at the "end solutions". The whole chain of thought is breaking out the context, and levels of abstraction. How many "Guesses" is it self generating and internally validating, well that's all just based on compute power and time.

My counter point to OP here would be is that is exactly how our brain works. In every given scenario, we are also evaluating all possible solutions. Our entire stack is constantly listening and eithier staying silent, or contributing to an action potential (eithier excitatory, or inhibitory). but our brain is always "Evaluating all potential possibilities" at any given moment. We have a society of mind always contributing their opinion, but the ones who don't have as much support essentially get "Shouted down".

thomasahle1y ago

> How many "Guesses" is it self generating and internally validating

That's completely fair game. That's just search.

nmca1y ago

It is allowed exactly two guesses, per the ARC rules.

trescenzi1y ago

How many guesses is the human comparison based on? I’d hope two as well but haven’t seen this anywhere so now I’m curious.

nmca1y ago

The real turker studies, resulting in the ~70% number, are scored correctly I believe. Higher numbers are just speculated human performance as far as I’m aware.

macrolime1y ago

The trick with AlphaGo was brute force combined with learning to extract strategies from brute force using reinforcement learning, that's what we'll see here. So maybe it costs a million dollars in compute to get a high score, but use reinforcement learning ala alphazero to learn from the process and it won't cost a million dollars next time and let it do lots of hard benchmarks, math problems and coding tasks and it'll keep getting better and better.

nextworddev1y ago

The best interpretation of this result is probably that it showed tackling some arbitrary benchmark is something you can throw money at, aka it’s just something money can solve.

Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations

torginus1y ago

this might be quite an important point - if they created an algorithm that can mimic human reasoning, but scales terribly with problem complexity (in terms of big O notation), it's still a very significant result, but it's not a 'humans brains are over' moment quite yet.

strangescript1y ago

"We have created artificial super intelligence, it has solved physics!"

"Well, yeah, but its kind of expensive" -- this guy

tymonPartyLateOP1y ago

Haha. Hopefully you’re right and solving the ARC puzzle translates to solving all of physics. I just remain skeptical about the OpenAI hype. They have a track record of exaggerating the significance of their releases and their impact on humanity.

jeremyjh1y ago

Please do show me a novel result in physics from any LLM. You think "this guy" is stupid because he doesn't extrapolate from this $2MM test that nearly reproduces the work of a STEM graduate to a super intelligence that has already solved physics. Maybe you've got it backwards.

strangescript1y ago

And two years ago this result would have been thought impossible.

Picks up goalpost, looks for stadium exit

freehorse1y ago

The problem is not that it is expensive, but that, most likely, it is not superintelligence. Superintelligence is not exploring the problem space semi-blindly, if the thounsands $$$ per task are actually spent for that. There is a reason the actual ARC-AGI prize requires efficiency, because the point is not "passing the test" but solving the framing problem of intelligence.

j / k navigate · click thread line to collapse

0 comments

tikkun1y ago

TrapLord_Rhodo1y ago

thomasahle1y ago

> How many "Guesses" is it self generating and internally validating

That's completely fair game. That's just search.

nmca1y ago

It is allowed exactly two guesses, per the ARC rules.

trescenzi1y ago

How many guesses is the human comparison based on? I’d hope two as well but haven’t seen this anywhere so now I’m curious.

nmca1y ago

The real turker studies, resulting in the ~70% number, are scored correctly I believe. Higher numbers are just speculated human performance as far as I’m aware.

macrolime1y ago

nextworddev1y ago

The best interpretation of this result is probably that it showed tackling some arbitrary benchmark is something you can throw money at, aka it’s just something money can solve.

Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations

torginus1y ago

strangescript1y ago

"We have created artificial super intelligence, it has solved physics!"

"Well, yeah, but its kind of expensive" -- this guy

tymonPartyLateOP1y ago

jeremyjh1y ago

strangescript1y ago

And two years ago this result would have been thought impossible.

Picks up goalpost, looks for stadium exit

freehorse1y ago

j / k navigate · click thread line to collapse