I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I
https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...
Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.
https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...
and an example using gpt-4o:
https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...
I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.
To be clear, my point is not at all that o1 is so incredibly smart. IMO the ARC-AGI puzzles show very clearly how dumb even the most advanced models are. My point is just that o1 does seem to be noticeably better at solving these problems than previous models.
I'm going to think through what I find "misleading" as I write this...
Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.
It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.
So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.
Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.
I don’t think that’s true though, it’s hard to be more fair and explicit than:
> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.
Ie. it’s just not that great, and it’s enormously slow.
That probably wasn’t what people wanted to hear, even if it is literally what the results show.
You cant run away from the numbers:
> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.
(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)
Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)
I entered task to Claude and asked to write py code, and it failed to recognize pattern:
To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:
It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.
Sheesh. We're going to need more compute.
It gets monotone easier but the increase can be so slow that even using all the energy in the observable universe wouldn't make a meaningful difference, e.g. for problems in the exponential complexity class.
>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.
Scores:
>GPT-4o: 9%
>o1-preview: 21%
>Claude 3.5 Sonnet: 21%
>MindsAI: 46% (current highest score)
Anthropic is just ahead.
I think regardless one of the reasons people are interested in it is that is a fairly simple logic puzzle - given some examples, extrapolate a pattern, execute the pattern - that humans achieve high accuracy on (a study linked on the website has ~84% accuracy for humans, some more recent study seems to put it closer to 75%). Yet ML approaches have yet to reach that level, in contrast to other problems ML has been applied to.
Given there is a large prize pool for the challenge, I would imagine actually training a model in the way you describe would already have been tried and is more difficult that it seems.
I guess the question is whether someone who solves this will have cracked AGI as a necessary precondition or like other Turing tests that have been solved, someone will find a technique that isn’t broadly applicable to general intelligence.
Besides which, it is unfair because it excludes an entire category of systems, not to mention a dominant one. If F. Chollet really believes ARC is a test of intelligence, then why not provide enough examples for deep nets or some other big data approach to be trained effectively? The answer is: because a big data approach would then easily beat the test. But if the test can be beaten without intelligence, just with data, then it's not a test of intelligence.
My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) [1] fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town. That's what happened with the WSC. A large dataset of Winograd schema sentences was crowd-sourced and a big BERT-era Transformer got around 90% accuracy on the WSC [2]. Bye bye WSC, and any wishful thinking about Winograd schemas requiring human intuition and other undefined stuff.
Or, ARC might go the way of the Bongard Problems [3]: the original 100 problems by Bongard still stand unsolved, but the machine learning community has effectively sidestepped them. Someone made a generator of Bongard-like problems [4], and while this was not enough to solve the original problems, everyone simply switched to training CNNs and reporting results on the new dataset [5].
We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence. The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that, and even if we observe say ChatGPT returning the right answer we can be sure it has only memorised, not calculated it, so we're a bit stuck. ARC won't get us unstuck in any way shape or form and so it's just a distraction.
_____________________
[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge
[2] WinoGrande: An Adversarial Winograd Schema Challenge at Scale
https://arxiv.org/abs/1907.10641
Although note the results are interpreted to mean LLMs are more or less memorising answers, which is right of course.
[3] Index of Bongard Problems
https://www.foundalis.com/res/bps/bpidx.htm
[4] Comparing machines and humans on a visual categorization test
https://www.pnas.org/doi/abs/10.1073/pnas.1109168108
[5] 25 years of CNNs: Can we compare to human abstraction capabilities?
That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.
I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.
Potentially not great.
If you look at the AIME accuracy graph on the OpenAI page [1] you will notice that the x-axis is logarithmic. Which is a problem because (a) compute in general has never scaled that well and (b) semiconductor fabrication will inevitably get harder as we approach smaller sizes.
So it looks like unless there is some ground-breaking research in the pipeline the current transformer architecture will likely start to stall out.
More compute hasn't been the driving factor of the last developments, the driving factor has been distillation and synthetic data. Since we've seen massive success with that, I really struggle to understand why people continue to doomsay the transformer. I hear these same arguments year after year and people never learn.
Also in general, I have yet to see these models plateau, Claude 3.5 Sonnet is a day and night different compared to previous models.
It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.
The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).
It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.
Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).
And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.
For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.
They say: "ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.
Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.
A successful submission is a pixel-perfect description (color and position) of the final task's output."
As far as I can tell, they are asking to reproduce exactly the final task's output.
So, how well might o1 do with Greenblatt's strategy?
“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”
“We still need new ideas for AGI.”
Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.