Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
The Kaggle competition goes live later this week and you can sign up here: https://arcprize.org/competition
We're in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.
Happy to answer questions.
I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.
ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.
Who said that cooking dinner couldn't be part of ARC-AGI-<N>?
There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.
Which is precisely what the robotic body I mentioned would be.
You're talking about humans who have the mental capacity to do these things, but who don't control a body capable of doing them. That's the exact opposite of an AI that controls a body capable of doing these things, but lacks the mental capacity to do them.
Put the computer in a wheelchair of his choice and let him try to catch the bus. How would you compare program and human reasoning abilities, but disregarding human ability to interact with the outside world?
Edit: Arc-AGI itself is only approachable by visually and manually valid humans, others needs assistive devices.
The scenarios you listed are examples of what they’re talking about. Those are tasks that humans can easily do but robots have a hard time with.
1. Public Train - 1,000 tasks that are public 2. Public Eval - 120 tasks that are public
So for those two we don't have protections.
3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.
4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.
You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!
The success of o3 directly contradicts us being in an "idea-constrained environment", what makes you believe that?
From ChatGPT 3.5 to o1, all LLMs progress came from investment in training: either by using much more data, or using higher quality data thanks to artificial data.
o1 (and then o3) broke this paradigm by applying a novel idea (RL+search on CoT) and that's because of it that it was able to make progress on ARC-AGI.
So IMO the success of o3 goes in favor of the argument of how we are in an idea-constrained environment.
And going back even further, there's Goal Oriented Action Planning - an old timey video game AI technique, that's basically searching through solution space to construct a plan:
https://medium.com/@vedantchaudhari/goal-oriented-action-pla...
(besides the fact that almost all old timey AI is state space solution search)
It's not a real solution because:
- It's way too expensive
- It doesn't scale the way a real solution does
This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can't do it but humans can.
I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don't think they identified anything particular with o3, at least they don't seem to have proven a step change.
ARC 1 was released long before in-context learning was identified in LLMs (and designed before Transformer-based LLMs existed), so the fact that LLMs can't do ARC was never a design consideration. It just turned out this way, which confirmed our initial assumption.
I think a similar claim could be levelled against other benchmarks or LLM evaluation tasks. One could say that the Turing test was designed to assess human intelligence, and LLMs pass it, therefore LLMs have human intelligence. This is generally considered to be false now, because we can plainly see that LLMs do not have intelligence in the same way as humans (yet? debatable, not the point), and instead we concluded that the Turing test was not the right benchmark. That's not to diminish its importance, it was hugely important as a part of AI education and possibly even AI development for decades.
ARC does seem to be pushing the boundaries, I'm just not convinced that it's testing a provable step change.
"Turing did not explicitly state that the Turing test could be used as a measure of "intelligence", or any other human quality. He wanted to provide a clear and understandable alternative to the word "think", which he could then use to reply to criticisms of the possibility of "thinking machines" and to suggest ways that research might move forward."
That's in no way different than claiming that LLMs understand language, or reason, etc, because they were designed that way.
Neural nets of all sorts have been beating benchmarks since forever, e.g. there's a ton of language understanding benchmarks pretty much all saturated by now (GLUE, SUPERGLUE ULTRASUPERAWESOMEGLUE ... OK I made that last one up) but passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.
Failing a benchmark also doesn't mean anything. A few years ago, at the first Kaggle competition, the entries were ad-hoc and amateurish. The first time a well-resourced team tried ARC (OpenAI) they ran roughshod over it and now you have to make a new one.
At some point you have to face the music: ARC is just another benchmark, destined to be beat in good time whenever anyone makes a concentrated effort at it and still prove nothing about intelligence, natural or artificial.
> passing them means nothing about the ability of neural net-based systems to understand language, regardless of how much their authors designed them to test language understanding.
Does this implicitly suggest that it is impossible to quantitatively assess a system’s ability to understand language? (Using the term “system” in the broadest possible sense)
Not agreeing or disagreeing or asking with skepticism. Genuinely asking what your position is here, since it seems like your comment eventually leads to the conclusion that it is unknowable whether a system external to yourself understands language, or, if it is possible, then only in a purely qualitative way, or perhaps purely in a Stewart-style-pornographic-threshold-test - you’ll know it when you see it.
I don’t have any problem if that’s your position- it might even be mine! I’m more or less of the mindset that debating whether artificial systems can have certain labels attached to them revolving around words like “understanding,” “cognition,” “sentience” etc is generally unhelpful, and it’s much more interesting to just talk about what the actual practical capabilities and functionalities of such systems are on the one hand in a very concrete, observable, hopefully quantitative sense, and how it feels to interact with them in a purely qualitative sense on the other hand. Benchmarks can be useful in the former but not the latter.
Just curious where you fall. How would you recommend we approach the desire to understand whether such systems can “understand language” or “solve problems” etc etc… or are these questions useless in your view? Or only useful in as much as they (the benchmarks/tests etc) drive the development of new methodologies/innovations/measurable capabilities, but not in assigning qualitative properties to said systems?
By the time OpenAI attempted ARC in 2024, a colossal amount of resources had already been expended trying to beat the benchmark. The OpenAI run itself costs several millions in inference compute alone.
ARC was the only benchmark that highlighted o3 as having qualitatively different abilities compared to all models that came before. o3 is a case of a good approach meeting an appropriate benchmark, rather than an effort to beat ARC specifically.
I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.
To be fair I've spent a lot of time thinking about cellular automata and Conway's game of life, which definitely seems to be influencing the design of these puzzles.
Congrats on launch, lets see how long it'll take to get saturated
I'd encourage you to review the definition of "brute force", and then consider the absolutely immense combinatoric space represented by the grids these puzzles use.
"Brute force" simply cannot touch these puzzles. An amount of understanding and pattern recognition is strictly required, even with the large quantities of test-time compute that were used against arc-agi-1.
Now, I wonder what surprises are to be found in the full dataset.
The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised
Our human-ability to abstract things is underrated.
These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize the opposite of "general intelligence"
Have any of the technical contributions used to win the past competition been used to advance general AI in any way?
We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?
To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
We had 40 papers submitted last year and 8 were awarded prizes. [1]
On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]
Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.
[1] https://arcprize.org/blog/arc-prize-2024-winners-technical-r...
[2] https://github.com/MohamedOsman1998/deep-learning-for-arc/bl...
Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.
Reasoner passed on first try.
“Correct!”
(See screenshot that shows one rated “hard” -- https://www.linkedin.com/posts/waynechang_tried-reasoner-on-...)
All ARC tasks are built entirely on top of "Core Knowledge" priors, the kind of elementary knowledge that a small child has already mastered and that is possessed universally by all humans.
Or let me ask differently. Can we still design text questions that are easy for humans and tough for AI?