OpenAI o1 Results on ARC-AGI-Pub (opens in new tab)

(arcprize.org)

187 pointsz71y ago118 comments

118 comments

71 comments · 18 top-level

alphabetting1y ago· 9 in thread

This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.

If ARC-AGI were a good benchmark for "AGI", then MindsAI should effectively be blowing away current frontier models by order of magnitude. I don't know what MindsAI is, but the post implies they're basically fine-tuning or using a very specific strategy for ARC-AGI that isn't really generalizable to other tasks.

I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I

drdeca1y ago

Perhaps a benchmark could be a good approximate upper bound for something without being a good approximate lower bound for that thing?

alphabetting1y ago

I clarified in a another post I mean for benchmarking standalone models, not ones fine-tuned for solving ARC

nightski1y ago

I mean, there are a lot of tasks that frontier models excel at which many humans wouldn't be able to complete.

zone4111y ago

Disagree. My opinion is that solving ARC-AGI won't get us any closer to AGI and it's mostly a distraction.

typon1y ago

I think solving ARC-AGI will be necessary but not sufficient. My bet is that the converse will not be true - a model that will be considered "AGI" but does poorly on ARC-AGI. So in that sense, I think this is an important benchmark.

1 more reply

meowface1y ago

I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.

alphabetting1y ago

How so? I think if a team is fine-tuning specifically to beat ARC that could be true but when you look at Sonnet and o1 getting 20%, I think a standalone frontier model beating it would mean we are close or already at AGI.

1 more reply

glial1y ago

Is that mainly because AGI is one of those "I'll know it when I see it" things?

killthebuddha1y ago· 8 in thread

In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

kobe_bryant1y ago

So it gives you the wrong answer and then you keep telling it how to fix it until it does? What does fancy prompting look like then, just feeding it the solution piece by piece?

killthebuddha1y ago

Basically yes, but there's a very wide range of how explicit the feedback could be. Here's an example where I tell gpt-4 exactly what the rule is and it still fails:

https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...

and an example using gpt-4o:

https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...

I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.

To be clear, my point is not at all that o1 is so incredibly smart. IMO the ARC-AGI puzzles show very clearly how dumb even the most advanced models are. My point is just that o1 does seem to be noticeably better at solving these problems than previous models.

3 more replies

mikeknoop1y ago

Author here. Which aspects are misleading? How can it be improved?

killthebuddha1y ago

I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.

It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.

So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

wokwokwok1y ago

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

4 more replies

usaar3331y ago

Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct them to do program synthesis to solve it. (that is write a program that solves it - e.g. https://pastebin.com/wDTWYcSx).

Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)

riku_iki1y ago

and what prompt you gave them to generate program? Did you tell explicitly that they need to fill cornered cells? If yes, it is not what benchmark is about. Benchmark is to ask LLM to figure out what is the pattern.

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:

1 more reply

riku_iki1y ago

Interesting part if you check CoT output, the way it solved: it said the pattern is to make number of filled cells even in each row with neat layout, which is interesting side effect, but not what task was about.

It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.

w41y ago· 7 in thread

> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

Sheesh. We're going to need more compute.

typon1y ago

Polar icecaps shuddering at the thought

asimpleusecase1y ago

That is the next major challenge. Ok you can solve a logic puzzle with a gilzillon watts now go power that same level of compute with a cheese burger, or if you are vegan a nice salad.

Davidzheng1y ago

Intelligence is something that gets monotone easier as compute increases and trivial at the large compute limit (for instance can brute force simulate a human at large enough compute). So increasing compute is the most sure way to ensure success at reaching above human level intelligence (agi)

etrautmann1y ago

This is…highly speculative and fairly ridiculous to anyone who’s attempted to do so

1 more reply

logicchains1y ago

>Intelligence is something that gets monotone easier as compute increases and trivial at the large compute limit (for instance can brute force simulate a human at large enough compute)

It gets monotone easier but the increase can be so slow that even using all the energy in the observable universe wouldn't make a meaningful difference, e.g. for problems in the exponential complexity class.

trehalose1y ago

How does one "brute force simulate a human"? If compute is the limiting factor, then isn't it currently possible to brute force simulate a human, just extremely slowly?

3 more replies

dr_dshiv1y ago

Now is a good time to spend with families and do work that feels satisfying. Much change is coming.

meowface1y ago· 6 in thread

Takeaway:

>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Scores:

>GPT-4o: 9%

>o1-preview: 21%

>Claude 3.5 Sonnet: 21%

>MindsAI: 46% (current highest score)

GaggiX1y ago

The takeaway is also that o1-preview is a major improvement compare to GPT-4o.

Anthropic is just ahead.

disgruntledphd21y ago

It's a little embarrassing for OpenAI though?

1 more reply

meowface1y ago

True. I've updated my post to include some of the scores.

Alifatisk1y ago

How the hell is Anthropic this far ahead? I am yet impressed

krackers1y ago

There were rumors that 3.5 Sonnet heavily used synthetic data for training, in the same way that OpenAI plans to use o1 to train Orion. Maybe this confirm it?

attentive1y ago

who knows what MindsAI is?

fsndz1y ago· 5 in thread

As expected, I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning

skepticATX1y ago

My feeling is that this is one reason they decided to hide the reasoning tokens.

fsndz1y ago

yes indeed

wslh1y ago

So basically it's a kind of overfitting with pattern matching features? This doesn't undermine the power of LLMs but it is great to study their limitations.

poopiokaka1y ago

“As expected I’m right”

fsndz1y ago

shouldn't I expect to be right when I have a thesis ? doesn't mean I can't see when I am wrong.

ec1096851y ago· 5 in thread

Why is this considered such a great AGI test? It seems possible to extensively train a model on the algorithms used to solve these cases, and some cases feel beyond what a human could straightforwardly figure out.

isotypic1y ago

Do you have some examples of ones you found beyond what a human could straightforwardly figure out? I tried a bunch and they all seemed reasonable, so I would be interested in seeing - I didn't try all 400, for obvious reasons, so I don't doubt there are difficult ones.

I think regardless one of the reasons people are interested in it is that is a fairly simple logic puzzle - given some examples, extrapolate a pattern, execute the pattern - that humans achieve high accuracy on (a study linked on the website has ~84% accuracy for humans, some more recent study seems to put it closer to 75%). Yet ML approaches have yet to reach that level, in contrast to other problems ML has been applied to.

Given there is a large prize pool for the challenge, I would imagine actually training a model in the way you describe would already have been tried and is more difficult that it seems.

ec1096851y ago

I realize I didn’t scroll to other examples for one I found very hard.

I guess the question is whether someone who solves this will have cracked AGI as a necessary precondition or like other Turing tests that have been solved, someone will find a technique that isn’t broadly applicable to general intelligence.

riku_iki1y ago

I think huge advantage is that they keep eval tests private, so corps can't finetune them to model and claim breakthrough, which possibly happened with many other benchmarks.

visarga1y ago

There is a hidden test set with new puzzle types not seen in the open part. It's designed so that humans do well and AI models have a hard time.

YeGoblynQueenne1y ago

"Designed" is not right. What gives "AI models" (i.e. deep neural nets) a hard time is that there are very few examples in the public training and evaluation set: each task has three examples. So basically it's not a test of intelligence but a test of sample efficiency.

Besides which, it is unfair because it excludes an entire category of systems, not to mention a dominant one. If F. Chollet really believes ARC is a test of intelligence, then why not provide enough examples for deep nets or some other big data approach to be trained effectively? The answer is: because a big data approach would then easily beat the test. But if the test can be beaten without intelligence, just with data, then it's not a test of intelligence.

My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) [1] fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town. That's what happened with the WSC. A large dataset of Winograd schema sentences was crowd-sourced and a big BERT-era Transformer got around 90% accuracy on the WSC [2]. Bye bye WSC, and any wishful thinking about Winograd schemas requiring human intuition and other undefined stuff.

Or, ARC might go the way of the Bongard Problems [3]: the original 100 problems by Bongard still stand unsolved, but the machine learning community has effectively sidestepped them. Someone made a generator of Bongard-like problems [4], and while this was not enough to solve the original problems, everyone simply switched to training CNNs and reporting results on the new dataset [5].

We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence. The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that, and even if we observe say ChatGPT returning the right answer we can be sure it has only memorised, not calculated it, so we're a bit stuck. ARC won't get us unstuck in any way shape or form and so it's just a distraction.

_____________________

[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[2] WinoGrande: An Adversarial Winograd Schema Challenge at Scale

https://arxiv.org/abs/1907.10641

Although note the results are interpreted to mean LLMs are more or less memorising answers, which is right of course.

[3] Index of Bongard Problems

https://www.foundalis.com/res/bps/bpidx.htm

[4] Comparing machines and humans on a visual categorization test

https://www.pnas.org/doi/abs/10.1073/pnas.1109168108

[5] 25 years of CNNs: Can we compare to human abstraction capabilities?

https://arxiv.org/abs/1607.08366

3 more replies

mrcwinn1y ago· 4 in thread

How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?

Satam1y ago

I think it's because OpenAI's leadership lacks good taste and talent. Realistically, they haven't shifted the needle with anything really interesting in 2 years now. They're using the inertia well but that's about it. Their model is not the best, the UI is not the best, and their pace of improvement is not great either.

falcor841y ago

I find the chatgpt-4o advanced mode to absolutely be "really interesting". And the video input they showed in the demos (and hope would same day release) could be a real game changer. One thing I would like to try, once that's out, is to put a computer with it amongst a group of students listening to a short lecture about something outside its training set and then check how the AI does on a comprehension quiz following the lecture - my feeling is that it would do significantly better than the average human student on most subjects.

WiSaGaN1y ago

Anthropic currently does much less hype stuff comparing to openai. It's remarkable that openai was like this until the GPT-4 release, and completly changed since Sam Altman started touring countries.

changoplatanero1y ago

One theory I heard is that Dario was always interested in RL whereas Ilya was interested in other stuff until more recently. So Anthropic could have had an earlier start on some of this latest RL stuff.

GaggiX1y ago· 3 in thread

It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.

That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.

I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

threeseed1y ago

> I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

Potentially not great.

If you look at the AIME accuracy graph on the OpenAI page [1] you will notice that the x-axis is logarithmic. Which is a problem because (a) compute in general has never scaled that well and (b) semiconductor fabrication will inevitably get harder as we approach smaller sizes.

So it looks like unless there is some ground-breaking research in the pipeline the current transformer architecture will likely start to stall out.

[1] https://openai.com/index/learning-to-reason-with-llms/

accountnum1y ago

It's not a problem, because the point at which we are in the logarithmic curve is the only thing that matters. No one in their right mind ever expected anything linear, because that would imply that creating a perfect oracle is possible.

More compute hasn't been the driving factor of the last developments, the driving factor has been distillation and synthetic data. Since we've seen massive success with that, I really struggle to understand why people continue to doomsay the transformer. I hear these same arguments year after year and people never learn.

GaggiX1y ago

I'm very optimistic about it because native multimodal LLMs have hardly been explored.

Also in general, I have yet to see these models plateau, Claude 3.5 Sonnet is a day and night different compared to previous models.

devit1y ago· 2 in thread

Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?

It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.

The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).

It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.

Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).

And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.

For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.

YeGoblynQueenne1y ago

No no, that's not right. They're not asking for specific solutions. Any transformation of one grid to another will do.

devit1y ago

What?

They say: "ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.

A successful submission is a pixel-perfect description (color and position) of the final task's output."

As far as I can tell, they are asking to reproduce exactly the final task's output.

1 more reply

Stevvo1y ago· 1 in thread

"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248

So, how well might o1 do with Greenblatt's strategy?

mikeknoop1y ago

I bet pretty well! Someone should try this. It's likely expensive but sampling could give you confidence to keep going. Ryan's approach costs about $10k to run the full 400 public eval set at current 4o prices -- which is the arbitrary limit we set for the public leaderboard.

Terretta1y ago· 1 in thread

TL;DR (direct quote):

“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”

“We still need new ideas for AGI.”

sashank_15091y ago

This sounds very fair, but I think fundamentally humans memorize reasoning a lot more than you’d expect. A spark of inspiration is not memorized reasoning, but not many people can claim to enjoy that capability.

a_wild_dandan1y ago· 1 in thread

This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.

falcor841y ago

What's noisy about it? The input matrix is discrete and converting it into any sort of structured input is trivial.

perching_aix1y ago· 1 in thread

Is it possible for me, a human, to undertake these benchmarks?

terhechte1y ago

There's examples on the homepage, and there's a link to the Kaggle notebook in the article.

https://arcprize.org

fancyfredbot1y ago

I found the level headed explanation of why log linear improvements in test score with increased compute aren't revolutionary the best part of this article. That's not to say the rest wasn't good too! One of the best articles on o1 I've read.

benreesman1y ago

The test you really want is the apples-to-apples comparison between GPT-4o faced with the same CoT and other context annealing that presumably, uh, Q* sorry Strawberry now feeds it (on your dime). This would of course require seeing the tokens you are paying for instead of being threatened with bans for asking about them.

Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.

lossolo1y ago

It seems like o1 is a lot worse than Claude on coding tasks https://livebench.ai

Alifatisk1y ago

This is a great marketing for Anthropic

bulbosaur1231y ago

Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?

j / k navigate · click thread line to collapse

118 comments

71 comments · 18 top-level

alphabetting1y ago· 9 in thread

This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.

krackers1y ago

I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I

drdeca1y ago

Perhaps a benchmark could be a good approximate upper bound for something without being a good approximate lower bound for that thing?

alphabetting1y ago

I clarified in a another post I mean for benchmarking standalone models, not ones fine-tuned for solving ARC

nightski1y ago

I mean, there are a lot of tasks that frontier models excel at which many humans wouldn't be able to complete.

zone4111y ago

Disagree. My opinion is that solving ARC-AGI won't get us any closer to AGI and it's mostly a distraction.

typon1y ago

1 more reply

meowface1y ago

I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.

alphabetting1y ago

1 more reply

glial1y ago

Is that mainly because AGI is one of those "I'll know it when I see it" things?

killthebuddha1y ago· 8 in thread

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

kobe_bryant1y ago

So it gives you the wrong answer and then you keep telling it how to fix it until it does? What does fancy prompting look like then, just feeding it the solution piece by piece?

killthebuddha1y ago

Basically yes, but there's a very wide range of how explicit the feedback could be. Here's an example where I tell gpt-4 exactly what the rule is and it still fails:

https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...

and an example using gpt-4o:

https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...

I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.

3 more replies

mikeknoop1y ago

Author here. Which aspects are misleading? How can it be improved?

killthebuddha1y ago

I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

wokwokwok1y ago

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

4 more replies

usaar3331y ago

Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct them to do program synthesis to solve it. (that is write a program that solves it - e.g. https://pastebin.com/wDTWYcSx).

Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)

riku_iki1y ago

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

1 more reply

riku_iki1y ago

It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.

w41y ago· 7 in thread

> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

Sheesh. We're going to need more compute.

typon1y ago

Polar icecaps shuddering at the thought

asimpleusecase1y ago

That is the next major challenge. Ok you can solve a logic puzzle with a gilzillon watts now go power that same level of compute with a cheese burger, or if you are vegan a nice salad.

Davidzheng1y ago

etrautmann1y ago

This is…highly speculative and fairly ridiculous to anyone who’s attempted to do so

1 more reply

logicchains1y ago

>Intelligence is something that gets monotone easier as compute increases and trivial at the large compute limit (for instance can brute force simulate a human at large enough compute)

trehalose1y ago

How does one "brute force simulate a human"? If compute is the limiting factor, then isn't it currently possible to brute force simulate a human, just extremely slowly?

3 more replies

dr_dshiv1y ago

Now is a good time to spend with families and do work that feels satisfying. Much change is coming.

meowface1y ago· 6 in thread

Takeaway:

>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Scores:

>GPT-4o: 9%

>o1-preview: 21%

>Claude 3.5 Sonnet: 21%

>MindsAI: 46% (current highest score)

GaggiX1y ago

The takeaway is also that o1-preview is a major improvement compare to GPT-4o.

Anthropic is just ahead.

disgruntledphd21y ago

It's a little embarrassing for OpenAI though?

1 more reply

meowface1y ago

True. I've updated my post to include some of the scores.

Alifatisk1y ago

How the hell is Anthropic this far ahead? I am yet impressed

krackers1y ago

There were rumors that 3.5 Sonnet heavily used synthetic data for training, in the same way that OpenAI plans to use o1 to train Orion. Maybe this confirm it?

attentive1y ago

who knows what MindsAI is?

fsndz1y ago· 5 in thread

skepticATX1y ago

My feeling is that this is one reason they decided to hide the reasoning tokens.

fsndz1y ago

yes indeed

wslh1y ago

So basically it's a kind of overfitting with pattern matching features? This doesn't undermine the power of LLMs but it is great to study their limitations.

poopiokaka1y ago

“As expected I’m right”

fsndz1y ago

shouldn't I expect to be right when I have a thesis ? doesn't mean I can't see when I am wrong.

ec1096851y ago· 5 in thread

isotypic1y ago

Given there is a large prize pool for the challenge, I would imagine actually training a model in the way you describe would already have been tried and is more difficult that it seems.

ec1096851y ago

I realize I didn’t scroll to other examples for one I found very hard.

riku_iki1y ago

I think huge advantage is that they keep eval tests private, so corps can't finetune them to model and claim breakthrough, which possibly happened with many other benchmarks.

visarga1y ago

There is a hidden test set with new puzzle types not seen in the open part. It's designed so that humans do well and AI models have a hard time.

YeGoblynQueenne1y ago

_____________________

[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[2] WinoGrande: An Adversarial Winograd Schema Challenge at Scale

https://arxiv.org/abs/1907.10641

Although note the results are interpreted to mean LLMs are more or less memorising answers, which is right of course.

[3] Index of Bongard Problems

https://www.foundalis.com/res/bps/bpidx.htm

[4] Comparing machines and humans on a visual categorization test

https://www.pnas.org/doi/abs/10.1073/pnas.1109168108

[5] 25 years of CNNs: Can we compare to human abstraction capabilities?

https://arxiv.org/abs/1607.08366

3 more replies

mrcwinn1y ago· 4 in thread

How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?

Satam1y ago

falcor841y ago

WiSaGaN1y ago

Anthropic currently does much less hype stuff comparing to openai. It's remarkable that openai was like this until the GPT-4 release, and completly changed since Sam Altman started touring countries.

changoplatanero1y ago

GaggiX1y ago· 3 in thread

It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.

I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

threeseed1y ago

> I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

Potentially not great.

So it looks like unless there is some ground-breaking research in the pipeline the current transformer architecture will likely start to stall out.

[1] https://openai.com/index/learning-to-reason-with-llms/

accountnum1y ago

GaggiX1y ago

I'm very optimistic about it because native multimodal LLMs have hardly been explored.

Also in general, I have yet to see these models plateau, Claude 3.5 Sonnet is a day and night different compared to previous models.

devit1y ago· 2 in thread

Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?

It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.

YeGoblynQueenne1y ago

No no, that's not right. They're not asking for specific solutions. Any transformation of one grid to another will do.

devit1y ago

What?

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.

A successful submission is a pixel-perfect description (color and position) of the final task's output."

As far as I can tell, they are asking to reproduce exactly the final task's output.

1 more reply

Stevvo1y ago· 1 in thread

"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248

So, how well might o1 do with Greenblatt's strategy?

mikeknoop1y ago

Terretta1y ago· 1 in thread

TL;DR (direct quote):

“We still need new ideas for AGI.”

sashank_15091y ago

a_wild_dandan1y ago· 1 in thread

This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.

falcor841y ago

What's noisy about it? The input matrix is discrete and converting it into any sort of structured input is trivial.

perching_aix1y ago· 1 in thread

Is it possible for me, a human, to undertake these benchmarks?

terhechte1y ago

There's examples on the homepage, and there's a link to the Kaggle notebook in the article.

https://arcprize.org

fancyfredbot1y ago

benreesman1y ago

lossolo1y ago

It seems like o1 is a lot worse than Claude on coding tasks https://livebench.ai

Alifatisk1y ago

This is a great marketing for Anthropic

bulbosaur1231y ago

Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?

j / k navigate · click thread line to collapse