undefined | Better HN

0 pointssn0wr8ven1y ago0 comments

Incredibly impressive. Still can't really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.

0 comments

kmacdough1y ago

The point of ARC is NOT to compare humans vs AI, but to probe the current boundary of AIs weaknesses. AI has been beating us at specific tasks like handwriting recognition for decades. Rather, it's when we can no longer readily find these "easy for human, hard for AI" reasoning tasks that we must stop and consider.

If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.

tintor1y ago

o3 low and high are the same model. Difference is in how long was it allowed to think.

It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.

cornholio1y ago

But does it matter if it "really, really" reasons in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?

While beyond current motels, that would be the final test of AGI capability.

jprete1y ago

If it's gaming the system, then it's much less likely to reliably come up with novel proofs or useful new theoretical ideas.

xanderlewis1y ago

That would be important, but as far as I know it hasn’t happened (despite how often it’s intimated that we’re on the verge of it happening).

edanm1y ago

I've seen one Twitter thread from a mathematician who used an llm to come up with a new math result. Both coming up with the theorem statement and a unique proof,iirc.

Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.

Jensson1y ago

Then he used it as a random text generator, LLM is by far the most configurable and best random test generators we have. You can use that to generate random theorem noise and then try to work with that to find actual theorems, still doesn't replace mathematicians though.

1 more reply

intended1y ago

Yeah, it really does matter if something was reasoned, or whether it appears if you metaphorically shake the magic 8 ball.

FartyMcFarter1y ago

How would gaming the system work here? Is there some flaw in the way the tasks are generated?

jprete1y ago

AI models have historically found lots of ways to game systems. My favorite example is exploiting bugs in simulator physics to "cheat" at games of computer tag. Another is a model for radiology tasks finding biases in diagnostic results using dates on the images. And of course whenever people discuss a benchmark publicly it leaks the benchmark into the training set, so the benchmark becomes a worse measure.

demirbey051y ago

I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.

ozten1y ago

Nope. AlphaZero taught itself to play games like chess, shogi, and Go through self-play, starting from random moves. It was not given any strategies or human gameplay data but was provided with the basic rules of each game to guide its learning process.

demirbey051y ago

Yes its reinforcement learning, but need to create policy and each policy is specialized for specific tasks.

sgt1011y ago

I thought that AlphaZero could play three games? Go, Chess and Shogi?

demirbey051y ago

Think I mean Catan :)

GaggiX1y ago

Humans and AIs are different, the next benchmark would be build so that it emphasize the weak points of current AI models where a human is expected to perform better, but I guess you can also make a benchmark that is the opposite, where humans struggle and o3 has an easy time.

vectorhacker1y ago

I think you've hit the nail on the head there. If these systems of reasoning are truly general then they should be able to perform consistently in the same way a human does across similar tasks, baring some variance.

pkphilip1y ago

Yes, if a system has actually achieved AGI, it is likely to not reveal that information

dlubarov1y ago

AGI wouldn't necessarily entail any autonomy or goals though. In principle there could be a superintelligent AI that's completely indifferent to such outcomes, with no particular goals beyond correctly answering question or what not.

HeatrayEnjoyer1y ago

AGI is a spectrum, not a binary quality.

pkphilip1y ago

Not sure why I am being downvoted. Why would a sufficiently advanced intelligence reveal its full capabilities knowing fully well that it would then be subjected to a range of constraints and restraints?

If you disagree with me, state why instead of opting to downvote me

j / k navigate · click thread line to collapse

0 comments

kmacdough1y ago

tintor1y ago

o3 low and high are the same model. Difference is in how long was it allowed to think.

It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.

cornholio1y ago

But does it matter if it "really, really" reasons in the human sense, if it's able to prove some famous math theorem or come up with a novel result in theoretical physics?

While beyond current motels, that would be the final test of AGI capability.

jprete1y ago

If it's gaming the system, then it's much less likely to reliably come up with novel proofs or useful new theoretical ideas.

xanderlewis1y ago

That would be important, but as far as I know it hasn’t happened (despite how often it’s intimated that we’re on the verge of it happening).

edanm1y ago

I've seen one Twitter thread from a mathematician who used an llm to come up with a new math result. Both coming up with the theorem statement and a unique proof,iirc.

Though to be clear, this wasn't a one shot thing - it was iirc a few months of back and forth chats with plenty of wrong turns too.

Jensson1y ago

1 more reply

intended1y ago

Yeah, it really does matter if something was reasoned, or whether it appears if you metaphorically shake the magic 8 ball.

FartyMcFarter1y ago

How would gaming the system work here? Is there some flaw in the way the tasks are generated?

jprete1y ago

demirbey051y ago

I am not expert in llm reasoning but I think because of RL. You cannot use AlphaZero to play other games.

ozten1y ago

demirbey051y ago

Yes its reinforcement learning, but need to create policy and each policy is specialized for specific tasks.

sgt1011y ago

I thought that AlphaZero could play three games? Go, Chess and Shogi?

demirbey051y ago

Think I mean Catan :)

GaggiX1y ago

vectorhacker1y ago

pkphilip1y ago

Yes, if a system has actually achieved AGI, it is likely to not reveal that information

dlubarov1y ago

HeatrayEnjoyer1y ago

AGI is a spectrum, not a binary quality.

pkphilip1y ago

If you disagree with me, state why instead of opting to downvote me

j / k navigate · click thread line to collapse