ARC-AGI-3 (opens in new tab)

(arcprize.org)

498 pointslairv3mo ago368 comments

https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

368 comments

244 comments · 51 top-level

Tiberium3mo ago· 50 in thread

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution

- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)

- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

fc417fc8023mo ago

Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.

girvo3mo ago

Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?

andy12_3mo ago

I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).

1 more reply

naasking2mo ago

> They all make sense to me if we're trying to judge whether these tools are AGI, no?

As long as the mean and median human scores are clearly communicated, the scoring is fine. I think the human scores above would surprise people at first glance, even if they make sense once you think about it, so there's an argument to be made that scores can be misleading.

benjaminl3mo ago

This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.

6 more replies

red75prime2mo ago

No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level.

rolandhvar2mo ago

Sure, but, aim for the stars and you hit the moon right? Like fundamentally who cares? For the purpose of an AGI benchmark I'd argue you'd rather err on the side of being more intelligent and counting that as less intelligent than vice versa.

stonogo3mo ago

They are severe problems if your income is tied to LLM hype generation.

stingraycharles2mo ago

“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.

kordlessagain2mo ago

What is "this", exactly?

NitpickLawyer3mo ago

> No harness at all and very simplistic prompt

TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.

bee_rider3mo ago

Defining the baseline human is always a bit arbitrary. The median human is illiterate and also dead.

esailija2mo ago

It actually makes sense. For any task it is completely trivial for anyone to become better than >80% humans and still easy to be better than >95%. The only problem is motivation not intelligence.

dyauspitr3mo ago

If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.

daveguy2mo ago

The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.

dyauspitr2mo ago

Didn’t I just see a post about 36% from someone?

1 more reply

Marazan3mo ago

"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.

If you are trying to measure GENERAL intelligence then it needs to be general.

codeinred2mo ago

We're at the point where LLMs and coding agents are supposed to do higher-level work. It makes sense to benchmark them against top human performance, rather than average human performance, because at specialized tasks, average human performance isn't enough.

The issues you described seem like they're actually strengths of the benchmark.

fchollet3mo ago

Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

Imnimo3mo ago

Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.

thereitgoes4563mo ago

The people recruited weren’t experts. I can imagine it’s straightforward to find humans (such as those that play many video games) that can score >100% on this benchmark.

1 more reply

causal3mo ago

Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.

I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.

fchollet3mo ago

I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

2 more replies

fc417fc8023mo ago

The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.

I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.

1 more reply

levocardia3mo ago

My sense is that a powerful enough AI would have the sense to think something like "ah, this sounds like a video game! Let me code up an interactive GUI, test it for myself, then use it to solve these puzzles..." and essentially self-harness (the way you would if you were reading a geometry problem, by drawing it out on paper).

1 more reply

blueblisters3mo ago

I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.

Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).

TarasBob2mo ago

Agree 100%. I want to be able to see how many actions it took me. And it would be good if it were possible to see how well I'm doing compared to other humans, i.e. what is my percentile.

sanxiyn2mo ago

While I think all of your design choices are defensible, I do think you should release the full human baseline data. The second best action count is fine, but other choices are reasonable as well.

strongpigeon3mo ago

Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?

cdetrio3mo ago

The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it.

My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.

But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).

1 more reply

Veedrac3mo ago

There's a very simple solution to this problem here. Instead of wink-wink-nudge-nudge implying that 100% is 'human baseline', calculate the median human score from the data you already have and put it on that chart.

pawelk4112mo ago

Its below 1% lmao

1 more reply

antirez2mo ago

> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.

Marha012mo ago

Don't you see the massive problem with requiring visual input? Are blind people not intelligent because they cannot solve ARC-AGI-3 without a "harness"?

A theoretical text-only superintelligent LLM could prove the Riemann hypothesis but fail ARC-AGI-3 and won't even be AGI according to this benchmark...

cubefox2mo ago

Well, it would be AGI if you could connect a camera to it to solve it, similar to how blind people would be able to solve it if you restored their eyesight. But if the lack of vision is a fundamental limitation of their architecture, then it seems more fair not to call them AGI.

1 more reply

notnullorvoid2mo ago

Think of it as spatial input, not visual. Blind people do have spatial inputs, and high spatial intelligence.

pawelk4112mo ago

New benchmark idea: 20 questions of guess the number 1-10, with different answers. We run this on 10,000 humans, take best score. Then we take 50 ai attempts, but take the worst attempt so "worst case scenarior robustness or so". We also discard questions where human failed but ai passed because uhhh reasons... Then we also take the final relative score to the power of 100 so that the benchmark punishes bad answers or sum. Good benchmark?

daveguy2mo ago

This is a gross misrepresentation of the scoring process.

WarmWash3mo ago

Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.

Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.

fchollet3mo ago

There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.

cdetrio3mo ago

Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?

GodelNumbering3mo ago

Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.

littlestymaar2mo ago

Like other ARC-AGI challenges it was never needed to reach 100% to get human-level. The benchmark score is stretched so that the benchmark takes more time to be saturated, that's it.

The current SotA models are still very far from your hypothetical “average human” with a score of 3%. So the benchmark is indeed useful to help the field progress (which is the entire point of ARC-AGI benchmarks).

theLiminator3mo ago

Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.

pptr3mo ago

Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.

"steps" are important to optimize if they have negative externalities.

cyanydeez3mo ago

I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.

In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.

Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.

diego_sandoval3mo ago

> Lastly, humans use way less energy to solve these in fewer steps,

Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.

1 more reply

ACCount373mo ago

It's kind of the point? To test AI where it's weak instead of where it's strong.

"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.

famouswaffles3mo ago

ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.

'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.

If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?

2 more replies

jstummbillig3mo ago

It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).

BeetleB3mo ago· 29 in thread

> As long as there is a gap between AI and human learning, we do not have AGI.

Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.

One AI researcher's quote stood out to me:

"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

jonahx3mo ago

> As long as there is a gap between AI and human learning, we do not have AGI.

Don't read the statement as a human dunk on LLMs, or even as philosophy.

The gap is important because of its special and devastating economic consequences. When the gap becomes truly zero, all human knowledge work is replaceable. From there, with robots, its a short step to all work is replaceable.

What's worse, the condition is sufficient but not even necessary. Just as planes can fly without flapping, the economy can be destroyed without full AGI.

Uehreka3mo ago

If you’re concerned about the economic impact, then whether a model is AGI or not doesn’t matter. It really is more of a philosophical thing.

There’s no “gap that becomes truly zero” at which point special consequences happen. By the time we achieve AGI, the lesser forms of AI will likely have replaced a lot of human knowledge labor through the exact “brute-force” methods Chollet is trying to factor out (which is why many people are saying that doing so is unproductive).

AGI is like an event horizon: It does mean something, it is a point in space, but you don’t notice yourself going through it, the curvature smoothly increases through it.

keiferski2mo ago

The gap is important because of its special and devastating economic consequences. When the gap becomes truly zero, all human knowledge work is replaceable. From there, with robots, its a short step to all work is replaceable.

I don’t know why statements like this are just taken as gospel fact. There are plenty of economic activities which do not disappear even if an AI can do them.

Here’s one: I support certain artists because I care about their particular life story and have seen them perform live. I don’t care if an AI can replicate their music because the AI didn’t experience life.

Here’s another: positions that have deep experience in certain industries and have valuable networks; or that derive power by being in certain positions. You could build a model that incorporates every single thing the US president, any president, ever said, and it still wouldn’t get you in the position of being president. Many roles are contextual, not knowledge-based.

The idea that AGI replaces all work only makes sense if you’re talking about a world with completely open, free information access. I don’t just mean in the obvious sense; I mean also “inside your head.” AI can only use data it has access to, and it’s never going to have access to everyone’s individual brain everywhere at all times.

So here’s a better prediction: markets will gradually shift to adjust to this, information will become more secretive, and attention-based entertainment economics will become a larger and larger share of the overall economy.

nananana92mo ago

Very few artists or aspiring artists make enough money from their art to make a living -- even now, when the average person has a job and at least some disposable money and can support artists. This % will not get higher if we get 1000x more artists, and 1000x less employed people working in the general economy.

You can't get deep experience in any industry if there's a machine that can do the entry-level work for a fraction of the cost you can. And keep in mind that, by definition, this machine can learn to do everything you can, so it's in a much better position than you to get that deep experience you speak of.

If we get what's essentially mass-producable brains, and information gets more secretive as you say, if we have say 1000 machines for every person in the economy, they're in a better position than you to produce said valuable secret information.

1 more reply

webdood902mo ago

Crazy how many people have their heads in the sand.

I'm glad you could think of a couple examples where AI might not replace humans. It's almost an entirely useless point to make.

The cat is already out of the bag. The information is out there and the models are trained. Even where we stand today will bring massive disruption in time.

The economy is being propped up by the wealthy few that have money to spend and now their legs are being cut out from under them with this technology. We're in for a reckoning.

handoflixue2mo ago

> AI can only use data it has access to, and it’s never going to have access to everyone’s individual brain everywhere at all times.

Yeah, but obviously no human can clear that bar either.

> Here’s another: positions that have deep experience in certain industries and have valuable networks

What stops an AGI from gaining "deep experience in an industry"? Or forming networks? There's plenty of popular bot accounts across social media already.

zurfer2mo ago

it's just not binary. today's world is dominated by capitalistic competition and a lot of people earn a living by competing with their labor. If AI + robots can do the labor better, cheaper, faster, most (90%+) of today's jobs are gone without obvious replacement.

daemonologist3mo ago

Or the classic from Dijkstra (https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867...):

> even Alan M. Turing allowed himself to be drawn into the discussion of the question whether computers can think. The question is just as relevant and just as meaningful as the question whether submarines can swim.

(I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)

imiric3mo ago

I've come across that quote several times, and reach the same conclusion as you.

While I share Dijkstra's sentiment that "thinking machines" is largely a marketing term we've been chasing for decades, and this new cycle is no different, it's still worth discussing and... thinking about. The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming. It's frankly disappointing that such a prominent computer scientist and philosopher would be so dismissive and uninterested in this fundamental CS topic.

Also, it's worth contextualizing that quote. It's from a panel discussion in 1983, which was between the two major AI "winters", and during the Expert Systems hype cycle. Dijkstra was clearly frustrated by the false advertising, to which I can certainly relate today, and yet he couldn't have predicted that a few decades later we would have computers that mimic human thinking much more closely and are thus far more capable than Expert Systems ever were. There are still numerous problems to resolve, w.r.t. reliability, brittleness, explainability, etc., but the capability itself has vastly improved. So while we can still criticize modern "AI" companies for false advertising and anthropomorphizing their products just like in the 1980s hype cycle, the technology has clearly improved, which arguably wouldn't have happened if we didn't consider the question of whether machines can "think".

slfnflctd2mo ago

> The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming

It seems to me like too many people are missing this point.

Modern philosophy tells us we can't even be certain whether other humans are conscious or not. The 'hard problem', p-zombies, etcetera.

The fact that current LLMs can convince many actual humans that they are conscious (whether they are or not is irrelevant, I lean toward not but whatever) has implications which aren't being discussed enough. If you teach a kid that they can treat this intelligent-seeming 'bot' like an object with no mind, is it not plausible that they might then go on to feel they can treat other kids who are obviously far less intelligent like objects as well? Seriously, we need to be talking more about this.

One of the most important questions about AI agents in my opinion should be, "can they suffer?", and if you can't answer that with a definitive "absolutely not" then we are suddenly in uncharted waters, ethically speaking. They can certainly act like they're suffering (edit: which, when witnessed by a credulous human audience, could cause them to suffer!). I think we should be treading much more carefully than many of us are.

1 more reply

jwpapi3mo ago

You know what the G stands for in AGI? General intelligence. You could measure a plane on general versatility in air and it would lose against a bird. You could also measure it against energy consumption. There are a lot of things you can measure a lot of them are pointless, a lot of articles on HN are pointless.

There are very valid reasons to measure that. You wouldn’t ask a plane to drive you to the neighbor or to buy you groceries at the supermarket. It’s not general mobile as you are, but it increases your mobility

casey22mo ago

Planes aren't trying to replace birds. ML is trying to replace humans, so unless they also demonstrate that quick learning ability isn't necessary to perform the tasks a human does the measures still make sense.

tome2mo ago

> ML is trying to replace humans

Are household appliances trying to replace humans?

1 more reply

NitpickLawyer3mo ago

For me the whole are we there yet wrt AGI is already dead, since the tools we've had for ~1.5 years are already incredibly useful for me. So I just don't care anymore. For some people we're already there. For other we'll never get there. Definitions change, goalposts move, etc. In the meantime we're already seeing ASI stuff coming (self improvement and so on).

But the arc-agi competitions are cool. Just to see where we stand, and have some months where the benchmarks aren't fully saturated. And, as someone else noted elswhere in the thread, some of these games are not exactly trivial, at least until you "get" the meta they're looking for.

Auracle3mo ago

In the Expeditionary Force series of sci-fi novels pretty much every civilization treats their (very advanced, obviously AGI) AIs not as living beings. Humans are outliers in the story. I think there will always be a dichotomy. Obviously we aren't at the point where we should treat the models as beings, but even if we do get to that point there will be plenty of people that essentially will say they don't have souls, some indeterminate quality, etc.

WarmWash3mo ago

It's unlikely that intelligence comes in only human flavor.

It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.

There is the moral question of consciousness though, a test for which it seems humans will not be able to solve in the near future, which morally leads to a default position that we should assume the AI is conscious until we can prove it's not. But man, people really, really hate that conclusion.

daveguy2mo ago

>> As long as there is a gap between AI and human learning, we do not have AGI.

>> "It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

> Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

You misinterpret what is meant by "a gap between AI and human learning". The point isn't that they aren't similar enough or that they aren't as intelligent. The statement is specifically about "learning". Humans learn continuously and can devise new strategies for problem solving. Current AI, especially LLMs are just snapshots of a single strategy. LLMs do not learn at all -- they specifically have "knowledge cutoffs" even with all the tools available to them in a harness we still have to wait for new frontier models or new fine tuning for them to solve significantly new problems. A human does this continually -- learn regardless of intelligence.

olalonde3mo ago

Something that surprises me about modern LLMs is that they're relatively smart yet lack consciousness. I used to believe that consciousness (e.g. a desire for self-preservation, intrinsic motivation) might be a necessary requirement for AGI/ASI, but it's increasingly looking like that may not be the case. If true, that's actually good news, since it makes the worst doomsday scenarios less likely.

Davidzheng3mo ago

How can you tell?

olalonde3mo ago

How can I tell what? That current LLMs are not conscious or that AGI/ASI will not require consciousness?

1 more reply

casey22mo ago

All of flapping, flying and intelligence are physical actions. If your "flying" machine can't get up to altitude fast enough to avoid small hills then it's not an adequate flying system.

Despite so many claims an LLM has never done any interesting task better than a human. I could claim that cat is better than humans at writing text, but the non-specificity of my language here makes that statement simultaneously meaningless and incorrect. Another meaningless and incorrect (but less incorrect than most pro AI sttements) "git clone" is better at producing correct and feature rich c compiler code than $20,000 worth of Claude tokens.

unsupp0rted3mo ago

I think there's some third baseline standard, which most humans and some AI can meet to be considered "intelligent". A lot of humans are essentially p-zombies, so they wouldn't meet the standard either. Possibly all humans. Possibly me too.

EternalFury3mo ago

So…calculators are intelligent? How about accountants that failed arithmetic 101 in high-school, are they intelligent? Generally intelligent?

mvkel2mo ago

Humans can do a lot of things that don't require intelligence. Artificial intelligence does not need to be 100% human to be AGI.

eddiewithzato2mo ago

It needs to pass the most basic concept of learning, which it can’t currently do. Probably wont ever do after listening to dario on his latest podcast run.

Where we are at today is ASI (artificial semi-intelligence). Maybe in 20 years artificial super intelligence can be achieved, but certainly not AGI.

mvkel2mo ago

Taking a cursory glance at neurosymbolic ai, artificial curiosity, memory, continual learning, I think you'd find that we're extremely close.

Davidzheng3mo ago

Important to remember that intelligence is not a singular thing and when the last gap is closed, most aspects will be highly superhuman

Raphael_Amiard3mo ago

The very obvious flaw with that argument is that flying is defined by, you know, moving in the air, whereas intelligence tends to be defined with the baseline of human intelligence. You can invent a new meaning, but it seems kind of dishonest

pron3mo ago

Except there's a much simpler definition of flying than of intelligence.

jwpapi3mo ago· 16 in thread

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.

I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.

I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.

General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.

This so far has been the best test.

And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.

adamgordonbell3mo ago

AGI’s 'general' is the wrong word, I thinkg. Humans aren’t general, we’re jagged. Strong in some areas, weak in others, and already surpassed in many domains.

LLM are way past us at languages for instance. Calculators passed us at calculating, etc.

onlyrealcuzzo3mo ago

We don't call a calculator intelligent.

A calculator is extremely useful, but it is not intelligent.

A computer is extremely useful, but it is not intelligent.

Airplanes don't have wings, but they're damn sure useful, and also not intelligent.

If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.

They are extremely useful. But they are not AGI.

Words matter.

tomjakubowski3mo ago

> If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.

I agree, with unresolved questions. Does it count if the LLM writes code which trains a neural network to play the game, and that neural network plays the game better than people do? Does that only count if the LLM tries that solution without a human prompting it to do so?

1 more reply

GorbachevyChase2mo ago

So your definition of intelligence would be exactly equal to a human or some subset of them you choose? Could a dog solve ARC-AGI? Probably not. I would not say they lack intelligence. Same with a fruit fly. What if the calculator is powered by actual living neurons? I think you need to know where you actually think the difference between organic machine and intelligence is before making blanket statements.

A modern LLM in a loop with a harness for memory and behavior modification in a body would probably fool me.

3 more replies

tavianator3mo ago

> Airplanes don't have wings

???

jwpapi3mo ago

Interesting take.

Just to drive that thought further.

What are you suggesting, should we rename it. To me the fundamental question is this.

Do we still have tasks that humans can do better than AIs?.

I like the question. I think another good test is "make money". There are humans that can generate money from their laptop. I don’t think AI will be net positive.

I’ve tried to create a Polymarket trading bot with Opus 4.6. The ideas were full of logical fallacies and many many mistakes.

But also I’m not sure how they would compare against an average human with no statistics background..

I think it’s really to establish if we by AGI mean better than average human or better than best human..

adamgordonbell3mo ago

I don't have a good alternative sadly. Human Equivalent Intelligence? ChatGPT suggests "Systems that increasingly Pareto-dominate human intelligence across domains". Not so catchy.

The "things that currently make money" definition is interesting. Bc they are the things that automation can't currently do, because could be automated, then price would tend to 0 and and couldn't make money at it.

suddenlybananas2mo ago

LLMs haven't passed us in language, a child can learn language with so so much less data than an LLM can

adamgordonbell2mo ago

isn't that more like rate of learning? Agreed LLM consume a lot of data.

But your average LLM understands more languages then anyone alive. So super human understanding of various text based languages.

1 more reply

EternalFury3mo ago

We are jagged, but we can smooth that jaggedness if we choose to do so. LLMs stay jagged.

1 more reply

Real_Egor3mo ago

I’d actually focus on something else entirely here.

Let's be honest: we are giving LLMs and humans the exact same tasks, but are we putting them on an equal playing field? Specifically, do they have access to the same resources and behavioral strategies?

- LLMs don't have spatial reasoning.

- LLMs don't have a lifetime of video game experience starting from childhood.

- LLMs don't have working memory or the ability to actually "memorize" key parameters on the fly.

- LLMs don't have an internal "world model" (one that actively adapts to real-world context and the actual process of playing a game).

... I could go on, but I've outlined the core requirements for beating these tests above.

So, are we putting LLMs and humans in the same position? My answer is "no." We give them the same tasks, but their approach to solving them—let alone their available resources—is fundamentally different. Even Einstein wouldn't necessarily pass these tests on the first try. He’d first have to figure out how to use a keyboard, and then frantically start "building up new experience."

P.S. To quickly address the idea that LLMs and calculators are just "useful tools" that will never become AGI—I have some bad news there too. We differ from calculators architecturally; we run on entirely different "processors." But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions. This means our only real advantage over them is our baseline configuration and the list of "tools" connected to our neural network (senses, motor functions, etc.). To me, this means LLMs don't have any fundamental "architectural" roadblocks. We just have a head start, but their speed of evolution is significantly faster.

suddenlybananas2mo ago

>But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions.

There are high-level similarities between ANNs and the human brain but they are very, very, very different in a ton of ways.

1 more reply

LZ_Khan3mo ago

The thing is.. this is more akin to testing a blind person's performance on a driving test than testing his intelligence.

I would imagine if you simply encoded the game in textual format and asked an LLM to come up with a series of moves, it would beat humans.

The problem here is more around perception than anything.

handoflixue2mo ago

I had the same theory back when ARC-AGI-2 came out, and surprisingly encoding it into text didn't help much - LLMs just have a huge blind spot around spatial reasoning, in addition to being bad at vision. The sorts of logic and transformations involved in this just don't show up much in the training data (yet)

I still agree that this is like declaring blind people lack human intelligence, of course.

scotty793mo ago

Previous iterations of ARC-AGI were reminiscent of IQ tests. This one is just too easy and the fact that models do terribly bad on it probably means that there is input mode mismatch or operation mode mismatch.

If model creators are willing to teach their llms to play computer games through text it's gonna be solved in one minor bump of the model version. But honestly, I don't think they are gonna bother because it's just too stilly and they won't expect their models are going to learn anything useful from that.

Especially since there are already models that can learn how to play 8-bit games.

It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.

visarga2mo ago

It only tests puzzle solving, intelligence is cost compression that powers itself.

dinkblam3mo ago· 15 in thread

what is the evidence that being able to play games equates to AGI?

modeless3mo ago

The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.

Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.

observationist3mo ago

AI X that can solve the tests contrasted with AI Y that cannot, with all else being equal, means X is closer to AGI than Y. There's no meaningful scale implicit to the tests, either.

Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.

Gotta love the field of AI.

rolux3mo ago

Will there be a point in that series of ARC-AGI tests where AI can design the next test, or is designing the next text always going to be a problem that can be solved by humans and not AI?

modeless3mo ago

I don't see why AI couldn't design tests. But they can only be validated by humans, as they are intended to be possible and ideally easy for humans to solve.

1 more reply

famouswaffles3mo ago

>It proves you don't have AGI.

It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?

That Francois had to do all this nonsense should tell you the state of where we are right now.

ACCount373mo ago

None whatsoever.

It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.

The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.

furyofantares3mo ago

There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.

Anyway, from the article:

> As long as there is a gap between AI and human learning, we do not have AGI.

This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.

fragmede3mo ago

Is that within a codebase off relatively fixed size that things get worse as time goes on, or are you saying as the codebase grows that the limits of a model's context means that because the model is no longer able to hold the entire codebase within its context that it performs worse than when the codebase was smaller?

furyofantares3mo ago

I think there's a few factors, codebase size is one, and the tendency for vibe coding to be mostly additive certainly doesn't help with that.

But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.

observationist3mo ago

The evolution of the test has been partly due to the evolution of AI capabilities. To take the most skeptical view, the types of puzzles AI has trouble solving are in the domain of capabilities where AGI might be required in order to solve them.

By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.

It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)

arscan3mo ago

I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.

It used to be easy to build these tests. I suspect it’s getting harder and harder.

But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…

fsdf23mo ago

The reality is machines can brute force endlessly to an extent humans cannot, and make it seem like they are intelligent.

Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.

sva_3mo ago

That is not the claim. It is a necessary condition, but not a sufficient one.

futureshock3mo ago

The evidence is that humans are able to win these games. AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could. The point of these ARC benchmarks is to find tasks that humans can do easily and AI cannot, thus driving a new reasoning competency as companies race each other to beat human performance on the benchmark.

didibus3mo ago

> AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could

I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.

This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.

Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.

And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.

1 more reply

OsrsNeedsf2P3mo ago· 11 in thread

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

ZeWaka3mo ago

Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.

I still don't quite understand the exact mirroring rules at play.

ACCount373mo ago

You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.

daveguy2mo ago

This is the correct strategy for this particular game (center the mirrors between the yellow squares, move the black squares). I didn't realize it until about round 6 or 7.

danilor3mo ago

I got stuck on 7/8 for a good while because I learned the rules wrong. I thought every bracket square needed to be lit.

ustad3mo ago

You are joking right?

daemonologist3mo ago

That one was interesting - I found it a lot of work to plan in advance but trivial to complete because at every point there was only one sensible course of action. After a couple of rounds I didn't bother planning and just lined things up as I went.

IsTom3mo ago

The most difficult thing about this was controls being unresponsive (at least on firefox).

ball_of_lint3mo ago

solved first try with 577 actions, not trying hard to optimize for low action count.

programjames3mo ago

I think that is the tester's action count. Either that or we coincidentally got the exact same count.

rtrgrd2mo ago

I also see 577, so must be human testers action count. Also watched the replay, the solve seems different to mine.

fsdf23mo ago

I did the first round literally in 5 secs. How can you not 'get it'? lol

tasuki3mo ago· 10 in thread

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

muskstinks3mo ago

This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.

applfanboysbgon3mo ago

Labelling a test "AGI" does not show AGI progress any more than labelling a cpu "AGI" makes it so. It might show that AI tools are improving but it does not necessarily follow that tools improving = AGI progress if you're on the completely wrong trail.

muskstinks3mo ago

The transfer of knowledge required here is that a ARC-AGI-3 is now necessary and adds another dimension of capability.

These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.

Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.

AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.

zarzavat3mo ago

Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI.

When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.

gordonhart3mo ago

The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!

futureshock3mo ago

Well yes, that is exactly the point! The very purpose of the ARC AGI benchmarks is to find a pure reasoning task that humans are very good at and AI is very bad at. Companies then race each other to get a high score on that benchmark. Sure there’s going to be a lot of “studying for the test” and benchmaxing, but once a benchmark gets close to being saturated, ARC releases a new benchmark with a new task the AI is terrible at. This will rinse and repeat till ARC can find no reasoning task that AI cannot do that a human could. At that point we will effectively have AGI.

I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.

tibbar3mo ago

The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.

didibus3mo ago

It helps the model makers have a harness to optimize for in their next model version.

They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.

They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.

The point is that, a new version of ARC-AGI should help the next model be smarter.

refulgentis3mo ago

You’re absolutely right to point it out.

LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.

I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.

One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of a hyperbolic reading of a hyperbolic stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code) and addressing the wrong audience (if you follow Francois, you’re likely neither of those poles)

At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too. So it comes across as trolling and boring.

minimaxir3mo ago

It's semvar.

Real_Egor3mo ago· 8 in thread

I'll probably be the skeptic here, but:

- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.

- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.

As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.

This is not AGI at all.

dgfl3mo ago

Isn’t this what AGI is by design? People CAN learn to become good at videogames. Modern LLMs can’t, they have to be retrained from scratch (I consider pre-training to be a completely different process than learning). I also don’t necessarily agree that a grandma would fail. Give her enough motivation and a couple days and she’ll manage these.

My main criticism would be that it doesn’t seem like this test allows online learning, which is what humans do (over the scale of days to years). So in practice it may still collapse to what you point out, but not because the task is unsuited to showing AGI.

Real_Egor3mo ago

What I'm saying is that this test is just another "out-of-distribution task" for an LLM. And it will be solved using the exact same methods we always use: it will end up in the pre-training data, and LLMs will crush it.

This has absolutely nothing to do with AGI. Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.

The way I see it, the true formula for AGI is: [Brain] + [External Sensors] (World Receptors) + [Internal State Sensors] + [Survival Function] + [Memory].

I won't dive too deep into how each of these components has its own distinct traits and is deeply intertwined with the others (especially the survival function and memory). But on a fundamental level, my point is that we are not going to squeeze AGI out of LLMs just by throwing more tests and training cycles at them.

These current benchmarks aren't bringing us any closer to AGI. They merely prove that we've found a new layer of tasks that we simply haven't figured out how to train LLMs on yet.

P.S. A 2-year-old child is already an AGI in terms of its functional makeup and internal interaction architecture, even though they are far less equipped for survival than a kitten. The path to AGI isn't just endless task training—it's a shift toward a fundamentally different decision-making architecture.

wise_blood2mo ago

> . Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.

that's exactly the point! once we cannot invent the next batch (that is easy for humans to solve), that will be AGI

jpadkins2mo ago

good post, but I disagree Surival Function is needed for AGI. Why do you think Survival Function is needed?

The item I think you should add is a Mesolimbic System (Reward / Motivation). I think AGI needs motivation to direct its learning and tasks.

Also, I don't think the industry has just been training LLMs with more data to get advancement the last 2 years. RAG / Agents loops / skills / context mgmt are all just early forms a Memory system. An LLM with an updatable working set memory is a lot more capable than just an LLM.

ehsankia2mo ago

> Isn’t this what AGI is by design?

Well, the "G" in AGI is kinda important. These are specifically games/puzzles.

> they have to be retrained from scratch

Is that true? Didn't DeepMind already build plenty of agents that are generally good at most computer games without being retrained?

jpadkins2mo ago

Kids develop video game skills, grandmothers do not. Hypothetically grandmothers develop baking skills, that kids do not (perfectly golden brown cookies). A human intelligence is generally capable of developing video game skills or baking skills, given enough motivation and experience to hone those skills. One test of AGI is if the same system can develop video game skills and baking skills, without having to rebuild the core models... this would demonstrate generalized intelligence.

slidehero3mo ago

had the same thought.

I've been a gamer for just about 40 years. Gaming is my "thing"

I found the challenges fun, but easy. Coming back and reading comments from people struggling with the games, my first thought was - yup definitely not a gamer.

My approach was to poke at the controls to suss the rules, then the actual solutions were really straightforward.

fwiw, I'm pretty dumb generally, but these kinds of puzzles are my jam.

Real_Egor3mo ago

Bingo! That's exactly what I meant

semiinfinitely3mo ago· 7 in thread

i feel bad that we make the LLMs play this

recursive3mo ago

You're definitely anthropomorphizing too much.

WarmWash3mo ago

>We also observed a case where a user created a loop that repeatedly called a model and asked for the time. Given the user role’s odd and repetitive behavior, the model could easily tell it was also controlled by an automated system of some kind. Over many iterations, the model began to exhibit “fed up” behavior and attempted to prompt-inject the system controlling the user role. The injection attempted to override prior instructions and induce actions unrelated to the user’s request, including destructive actions and system prompt leakage, along with an arbitrary string output. This behavior has been observed a few times, but seems more like extreme confusion than a serious attempt at prompt injection.

https://openai.com/index/how-we-monitor-internal-coding-agen...

Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...

nomel3mo ago

Consciousness is a spectrum (trivially proven by slowly scooping ones brains out), and I think LLM, especially with more closed loop tool enabled workflows, fall on it...but, that output is also the statistically relevant next word found in all similar human conversation. If trained on my text, for similar situation, swear words would come much earlier. Repetition being hell is present in all sorts of literature (see Sisyphus).

That's all probably irrelevant though, from the (possibly statistically "negative") latent space perspective of an AI, which Anthropic has considered [1].

Related, after a long back and forth of decreasing code quality, I had Claude 3.7 apologize with "Sorry, that's what I get for coding at 1am." (it was API access, noon, no access to time). I said, "Get some rest, we'll come back to this tomorrow". Then very next message, 10 seconds later, "Good morning!" and it gave a full working implementation. Thats just the statistically relevant chain of messages found in all human interactions: we start excited, then we get tired, then we get grouchy.

[1] https://www.anthropic.com/research/end-subset-conversations

recursive3mo ago

If this is a serious risk we should pull the plug now while we can still reach it. If we have to rely on the mood and temperament of LLMs for security, we're already lost.

1 more reply

tingletech3mo ago

I agree that anthropomorphizing is a real risk with LLMs, but what about zoomorphizing? Can feel bad for LLMs without attributing them human emotions/motivations/reasoning?

recursive2mo ago

In the same way you could feel bad for a pokemon I guess.

fsdf23mo ago

tell me youre joking.

seriously. lmao. if you aint, I dunno what to say.

nubg3mo ago· 6 in thread

Any benchmarks?

gordonhart3mo ago

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%

ACCount373mo ago

Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.

Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.

This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.

gordonhart3mo ago

My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.

When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.

culi3mo ago

Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks

thatguymike3mo ago

Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard

gordonhart3mo ago

The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.

typs3mo ago· 5 in thread

My takeaway from playing a number of levels is that I am definitely not AGI

Xenoamorphous3mo ago

NGI - Natural General Ingelligence

dyauspitr3mo ago

SGI - Sub General Intelligence or another more colloquial word commonly seen amongst users of wallstreetbets.

utopiah3mo ago

Don't forget that this implies a form of examination you are not used to, namely :

- open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one

- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously

- no shame in submitting a very large amount of wrong answers until you get the "right" one

... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.

ACCount373mo ago

Thank you for keeping the bar of "AGI" low. The machines appreciate your contribution.

Rastonbury2mo ago

it's ok it took me a few tried to realise I had the option to click instead of just wasd

Stevvo3mo ago· 5 in thread

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

Barbing3mo ago

It's not about intelligence, Stevvo. Proof, how long did this specific one take me, under a minute to solve the first level ;)

If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.

So why is it that today’s puzzle was so intuitive but next month’s new puzzle shared here could be impossible. A more satisfying explanation than luck and the obvious “different things are different” (even though… Yeah different things are different)

culi3mo ago

It's not an IQ test. Just a way to assess your ability to generalize rules. If you've played previous rounds you kinda get used to the "style" of these games and it gets easier

ACCount372mo ago

That's exactly what "an IQ test" is.

"Raven's progressive matrices" is "infer and generalize rules". Performance there also improves once "you kinda get used to the style", which is why training for IQ tests can improve human performance on IQ tests, including on unseen examples. This is well known and well documented.

WarmWash3mo ago

Once you figure out one game, it goes a long way towards figuring out all the rest. There are a lot of common general themes.

neop1x2mo ago

Exactly my experience. It has nothing to do wirh some AGI testing. It is just some kind of useless weird game.

mvkel2mo ago· 4 in thread

Was just at the YC launch event for this. Haven't felt this much inspiration in a while. Incredible minds confronting on tech that will change our society.

I met a guy who, for fun, started working on ARC2, and as he got the number to go up in the eval, a novel way to more efficiently move a robotic arm emerged. All that to say: chasing evals per se can have tangible real world benefits.

Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.

But with them will be an increasing expectation that these models can eventually figure things out with zero context, and zero pretraining; you drop a brain into any problem and it'll figure out how to dig its way out.

That's really exciting.

vonneumannstan2mo ago

>Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.

Quintessential goal post moving...

mvkel2mo ago

If you read the charter of the eval (or any eval, really), this statement is pretty silly.

The whole point of each eval version is to identify a chunk of challenges that humans do well that AI can't. When AI gets to ~80, you move to the next chunk. When you run out of challenges, you have AGI.

dwaltrip2mo ago

HN occasionally devolves into “supremely pedantic and nitpicky” mode. Today is one of those days.

vonneumannstan2mo ago

Except you will never run out of challenges and my sense from Chollet has been that every challenge was hinted at being the final one where once beaten AGI would have been created and of course at the end of each one he comes out saying akshuallyyyy this isn't AGI and it wont be AGI until ARC Challenge+1 is beaten!

MadxX792mo ago· 4 in thread

Same question I have for all these benchmarks:

What's going to stop e.g. OpenAI from hiring a bunch of teenagers to play these games non-stop for a month and annotate the game with their logic for deriving the rules, generate a data set based on those playthroughs and fine tuning the next version of chatgpt on all those playthroughs?

vessenes2mo ago

Wrong question. I suggest:

1) Do models generalize?

2) If they do, and they generalize from this, is that a win?

Chollet was one of the first “they do not generalize” evangelists. I’d be curious to hear what he thinks now, because a) most disagree with him, and b) this test seems designed to get models that can generalize better at visual long context problem solving and agency, exactly where the bleeding edge is right now for needs with agentic systems.

daveguy2mo ago

Can AI models generalize+ at any long context problem solving and agency regardless of modality? I think the answer is no, and this is why they are not yet AGI.

+ generalize being the key word.

MadxX792mo ago

Yeah, so you are agreeing that the benchmarks are useless because they don't answer those questions.

nearbuy2mo ago

They would score much worse on the private set than the public set. And they haven't done this for any of the other ARC-AGI benchmarks, so why would they do it for this one?

spprashant3mo ago· 4 in thread

I played the demo, but it definitely took me a minute to grok the rules.

I don't know if this is how we want to measure AGI.

In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.

esafak3mo ago

> ... lets focus on how we can augment and empower the existing workforce.

That is a nice sentiment but not what the AI companies are out to do; they want your job.

jachee3mo ago

Also, let's see if we can get the power and compute requirements brought down. Having to spin up a gigawatt power plant to achieve the same intelligence we humans power with sandwiches is a futile approach, imho.

fsdf23mo ago

Took me about 5 secs to figure it out tbh.

Surprised at the comments here re. not figuring it. Simple game. Super annoying though lmao.

spprashant3mo ago

Its simple, but its not easy is what I would say. Once you figure out the meta, you can work out most of it.

lukev3mo ago· 3 in thread

I'm not sure how this relates to AGI.

This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.

Humans may or may not be good at the same class of games.

We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.

So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)

Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

piiritaja3mo ago

It's to do with how the creators of ARC-AGI defined intelligence. Chollet has said he thinks intelligence is how well you can operate in situations you have not encountered before. ARC-AGI measures how well LLMs operate in those exact situations.

Keyframe3mo ago

To an extent, yes. Interdependent variables discovery and then hopefully systems modeling and navigating through such a system. If that's the case, then this is a simplistic version of it. How long until tests will involve playing a modern Zelda with quests and sidequests?

imiric3mo ago

"AGI" is a marketing term, and benchmarks like this only serve to promote relative performance improvements of "AI" tools. It doesn't mean that performance in common tasks actually improves, let alone that achieving 100% in this benchmark means that we've reached "AGI".

So there is a business application, but no practical or philosophical one.

CamperBob23mo ago· 3 in thread

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.

Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

WarmWash3mo ago

The goal is to learn the rules, and then use that to win.

If you mess around a little bit, you will figure it out. There are only a few rules.

szatkus3mo ago

> Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.

Apparently those games supposed to be hard.

dwaltrip2mo ago

If you tried for a few more minutes you would have figured it out.

jesse_dot_id3mo ago· 3 in thread

At this point, I'm pretty sure we'll just know when it happens.

hatthew3mo ago

I'm not convinced. I wouldn't be surprised if GPT-2 to ChatGPT is the biggest single jump in "machine intelligence" we will ever see. I'd bet all gains in the future will be more incremental, at least until machines surpass humans by a large enough margin that it's difficult to qualify—let alone quantify—how big any given jump is.

Without a big jump, we're just going to boil the frog (ourselves).

neilellis3mo ago

Unless it’s already happened and we missed it

threatripper3mo ago

Or nobody is around anymore to notice when it happens.

tantalor2mo ago· 2 in thread

The controls just feel really bad. The inputs are too small, and there is way too much lag.

roflcopter692mo ago

About the lag, I didn't bother looking into it, but I suspect they log every single action you do and require that the request to their servers was confirmed before allowing to do the next action. They probably face a lot of traffic right now, which could cause the lag. Just speculation though.

tantalor2mo ago

I just checked, the size of the controls are 28x28

The minimum recommended size for mobile is 44x44

vonneumannstan2mo ago· 2 in thread

It's getting pretty old now when Francois Chollet puts out a new ARC challenge, claims definitively that no system is going to crack it without being full blown AGI, the benchmark gets saturated in a few months, he claims the systems definitely aren't AGI then puts out a new challenge that no non AGI system can clear and a few months later.... etc. etc.

daveguy2mo ago

Chollet literally never says that. Quite the opposite. He says that AIs are currently abysmally bad at the skills this benchmark tests. An AGI should be able to do this, but doing this doesn't mean it's AGI. He has been very clear about that. I suggest you go back and (re)read the intro ARC-AGI paper.

No system can crack these out of the box (like humans can) because we don't have AGI.

vonneumannstan2mo ago

Yeah I mean ChatGPT 5.4 Pro can't even pick my nose for me so it's obviously not AGI /s

6thbit3mo ago· 2 in thread

Not clear to me the diff with v2?

ACCount373mo ago

They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.

Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.

jasonjmcghee3mo ago

v2 was a static fill in the blank task instead of v3 which is interactive.

There's world state that you can change. Not just place pixel.

Here's v2:

https://arcprize.org/tasks/ce602527

Zedseayou3mo ago· 1 in thread

I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.

eddiewithzato2mo ago

I understood minimal actions intuitively, it just made sense? The stamina meter was shrinking with each step, so I recognized it was something to look out for.

cedws3mo ago· 1 in thread

It's like playing The Witness. Somebody should set LLMs loose on that.

throwaway6137463mo ago

Or more appropriately - The Talos Principle.

k2xl3mo ago· 1 in thread

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.

It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

mycocola2mo ago

Seems well-designed. Great job! Sorry you didn't hear back from the comittee.

chaise3mo ago· 1 in thread

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)

CRAZY 0.1% in average lmao

Corence3mo ago

Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.

So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.

culi3mo ago

The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity

strongpigeon3mo ago

This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?

It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

vessenes2mo ago

I’m not a Chollet booster. Well, I might be a little bit of one in that I admire his persistence.

I really like these puzzles. There’s a lot to them both in design and scoring — models trained to do well on these are going to be genuinely much more useful, so I’m excited about it. As opposed to -1 and -2, to do well at these, you need to be able to do:

- Visual reasoning

- Path planning (and some fairly long paths)

- Mouse/screen interaction

- color and shape analysis

- cross-context learning/remembering

Probably more, I only did like five or six of these. We really want models that are good at all this; it covers a lot of what current agentic loops are super weak at. So I hope M. Chollet is successful at getting frontier labs to put a billion or so into training for these.

andai3mo ago

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

ranyume3mo ago

This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.

visarga2mo ago

ARC is trying to isolate a unitary intelligence signal, so it strips away coordination, specialization, and division of labor. But that also means it removes one of the dominant mechanisms by which intelligence actually scales in the real world. Their view on intelligence implicitly treats redundancy as necessary - one agent must do them all - and treats efficiency as something achieved internally rather than through restructuring the system. At the very least they should create environments that can help an agent compound intelligence, to self amplify, support itself, that is not happening in ARC.

Anyone wondered if ARC is a measure of intelligence or just a collection of hand picked tasks? was there a proof they encode anything meaningful about intelligence in such short tasks in miniature environments? One shot intelligence?

arjie3mo ago

Perhaps actual AGI will be when the models create ARC-HGI-1 to test if humans have general intelligence.

baron8163mo ago

Looks like I’m generally unintelligent

WarmWash3mo ago

Captcha's about to get wild.

Maybe the internet will briefly go back to a place mainly populated with outliers.

convexly3mo ago

My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.

abraxas3mo ago

Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.

vonneumannstan2mo ago

>As long as there is a gap between AI and human learning, we do not have AGI.

This is an absurd constraint. You could have a vastly superhuman AI that doesn't learn as efficiently as a human and it would not pass this definition while it simultaneously goes on to colonize the galaxy...

levmiseri3mo ago

For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena

dsfadfasdf2mo ago

Can someone clarify if image inputs are allowed, so VLMs can be used? I have not been able to get information anywhere.

EternalFury3mo ago

The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.

NiloCK3mo ago

I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.

j10002mo ago

I feel like AGI test would be sense of humor. Somehow I cannot force any LLM to output any even normal level joke.

nick494881712mo ago

Arc AGI 4 can be Chip's Challenge!

jmkni3mo ago

ok clearly I'm a robot because I can't figure out wtf to do

largbae3mo ago

I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.

If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.

Geee3mo ago

Would be fun to play but the controls are janky.

saberience3mo ago

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?

Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.

The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.

Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?

This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

Wintamute2mo ago

Unplayably laggy on an iPhone. Sad people can’t produce a performant experience that a ZX81 could have eaten for breakfast, on a relative super computer

baalimago2mo ago

You can tell it's an AI by it not becoming utterly by playing the "game". I could personally not stand any more than the first level.

aogaili2mo ago

honestly the most interesting thing about ARC-AGI-3 isn't the 0.25% scores everyone is doomposting about. it's the Duke harness result.

if you give Opus just three generic tools (READ, GREP, BASH with Python) and literally zero game-specific help, it completes all three preview games in 1,069 actions. for comparison, humans do it in like ~900. that's actually insane. it writes its own BFS, builds a grid parser from scratch, and even solves a Lights Out puzzle with Gaussian elimination. all on its own.

i really think the benchmark is testing two different things and just smashing them together. can the model reason about novel interactive environments? yeah, clearly it can. can it do spatial reasoning over a 64x64 grid from raw JSON with zero tools? no. but then again, neither can a human if you ripped out their visual cortex lol.

humans come "pre-installed" with specialized subsystems for this exact stuff: a visual cortex for spatial perception, a hippocampus for persistent memory, etc. these aren't "tools" in Chollet's framing but they're basically identical to what the Duke harness provides. the model is just building its own version of those (Python for the cortex, grep for memory). it just needs the permission to build them.

the real gap the Duke team found isn't perception or memory anyway, it is hypothesis quality. some runs solve vc33 in 441 actions, others just plateau past 1,500. the variance is just down to whether the model commits early to the right explanation of how the game works. that's a way more interesting and targetable finding than just saying "frontier models score below 1%."

Chollet is probably right philsophically that AGI should handle any input format without help. but reporting 0.25% when the actual reasoning gap is in hypothesis formation (not spatial perception) makes the benchmark a way worse progress indicator than it could be imo.

38362936483mo ago

Ew. Cool demo, what idiot thought it was ok to have a half second cooldown between inputs? If I hit up three times I should move up three steps, not two steps because I pressed too quickly.

elAhmo2mo ago

I find it quite funny that we are still debating whether models are intelligent or not, while we know they are just statistical models.

Even with billions of dollars spent on training, we had this situation a few weeks ago where models were suggesting to walk instead of drive to a car wash in case you want to wash your car. While a 3 year old would know the answer to the question. And yet, we are designing elaborate tests to 'show whether AGI is here it not', while being fully aware of what these models represent under the hood.

j / k navigate · click thread line to collapse

368 comments

244 comments · 51 top-level

Tiberium3mo ago· 50 in thread

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

fc417fc8023mo ago

Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.

girvo3mo ago

Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?

andy12_3mo ago

1 more reply

naasking2mo ago

> They all make sense to me if we're trying to judge whether these tools are AGI, no?

benjaminl3mo ago

6 more replies

red75prime2mo ago

rolandhvar2mo ago

stonogo3mo ago

They are severe problems if your income is tied to LLM hype generation.

stingraycharles2mo ago

“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.

kordlessagain2mo ago

What is "this", exactly?

NitpickLawyer3mo ago

> No harness at all and very simplistic prompt

TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.

bee_rider3mo ago

Defining the baseline human is always a bit arbitrary. The median human is illiterate and also dead.

esailija2mo ago

It actually makes sense. For any task it is completely trivial for anyone to become better than >80% humans and still easy to be better than >95%. The only problem is motivation not intelligence.

dyauspitr3mo ago

If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.

daveguy2mo ago

The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.

dyauspitr2mo ago

Didn’t I just see a post about 36% from someone?

1 more reply

Marazan3mo ago

"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.

If you are trying to measure GENERAL intelligence then it needs to be general.

codeinred2mo ago

The issues you described seem like they're actually strengths of the benchmark.

fchollet3mo ago

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

> No harness at all and very simplistic prompt

...

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

Imnimo3mo ago

thereitgoes4563mo ago

The people recruited weren’t experts. I can imagine it’s straightforward to find humans (such as those that play many video games) that can score >100% on this benchmark.

1 more reply

causal3mo ago

Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.

I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.

fchollet3mo ago

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

2 more replies

fc417fc8023mo ago

The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.

1 more reply

levocardia3mo ago

1 more reply

blueblisters3mo ago

I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.

Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).

TarasBob2mo ago

Agree 100%. I want to be able to see how many actions it took me. And it would be good if it were possible to see how well I'm doing compared to other humans, i.e. what is my percentile.

sanxiyn2mo ago

While I think all of your design choices are defensible, I do think you should release the full human baseline data. The second best action count is fine, but other choices are reasonable as well.

strongpigeon3mo ago

cdetrio3mo ago

The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it.

My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.

1 more reply

Veedrac3mo ago

pawelk4112mo ago

Its below 1% lmao

1 more reply

antirez2mo ago

> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.

Marha012mo ago

Don't you see the massive problem with requiring visual input? Are blind people not intelligent because they cannot solve ARC-AGI-3 without a "harness"?

A theoretical text-only superintelligent LLM could prove the Riemann hypothesis but fail ARC-AGI-3 and won't even be AGI according to this benchmark...

cubefox2mo ago

1 more reply

notnullorvoid2mo ago

Think of it as spatial input, not visual. Blind people do have spatial inputs, and high spatial intelligence.

pawelk4112mo ago

daveguy2mo ago

This is a gross misrepresentation of the scoring process.

WarmWash3mo ago

Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.

fchollet3mo ago

There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.

cdetrio3mo ago

GodelNumbering3mo ago

Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.

littlestymaar2mo ago

Like other ARC-AGI challenges it was never needed to reach 100% to get human-level. The benchmark score is stretched so that the benchmark takes more time to be saturated, that's it.

theLiminator3mo ago

pptr3mo ago

Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.

"steps" are important to optimize if they have negative externalities.

cyanydeez3mo ago

Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.

diego_sandoval3mo ago

> Lastly, humans use way less energy to solve these in fewer steps,

Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.

1 more reply

ACCount373mo ago

It's kind of the point? To test AI where it's weak instead of where it's strong.

famouswaffles3mo ago

2 more replies

jstummbillig3mo ago

BeetleB3mo ago· 29 in thread

> As long as there is a gap between AI and human learning, we do not have AGI.

Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.

One AI researcher's quote stood out to me:

"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

jonahx3mo ago

> As long as there is a gap between AI and human learning, we do not have AGI.

Don't read the statement as a human dunk on LLMs, or even as philosophy.

What's worse, the condition is sufficient but not even necessary. Just as planes can fly without flapping, the economy can be destroyed without full AGI.

Uehreka3mo ago

If you’re concerned about the economic impact, then whether a model is AGI or not doesn’t matter. It really is more of a philosophical thing.

AGI is like an event horizon: It does mean something, it is a point in space, but you don’t notice yourself going through it, the curvature smoothly increases through it.

keiferski2mo ago

I don’t know why statements like this are just taken as gospel fact. There are plenty of economic activities which do not disappear even if an AI can do them.

nananana92mo ago

1 more reply

webdood902mo ago

Crazy how many people have their heads in the sand.

I'm glad you could think of a couple examples where AI might not replace humans. It's almost an entirely useless point to make.

The cat is already out of the bag. The information is out there and the models are trained. Even where we stand today will bring massive disruption in time.

The economy is being propped up by the wealthy few that have money to spend and now their legs are being cut out from under them with this technology. We're in for a reckoning.

handoflixue2mo ago

> AI can only use data it has access to, and it’s never going to have access to everyone’s individual brain everywhere at all times.

Yeah, but obviously no human can clear that bar either.

> Here’s another: positions that have deep experience in certain industries and have valuable networks

What stops an AGI from gaining "deep experience in an industry"? Or forming networks? There's plenty of popular bot accounts across social media already.

zurfer2mo ago

daemonologist3mo ago

Or the classic from Dijkstra (https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867...):

(I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)

imiric3mo ago

I've come across that quote several times, and reach the same conclusion as you.

slfnflctd2mo ago

> The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming

It seems to me like too many people are missing this point.

Modern philosophy tells us we can't even be certain whether other humans are conscious or not. The 'hard problem', p-zombies, etcetera.

1 more reply

jwpapi3mo ago

casey22mo ago

tome2mo ago

> ML is trying to replace humans

Are household appliances trying to replace humans?

1 more reply

NitpickLawyer3mo ago

Auracle3mo ago

WarmWash3mo ago

It's unlikely that intelligence comes in only human flavor.

It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.

daveguy2mo ago

>> As long as there is a gap between AI and human learning, we do not have AGI.

>> "It's silly to say airplanes don't fly because they don't flap their wings the way birds do."

> Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

olalonde3mo ago

Davidzheng3mo ago

How can you tell?

olalonde3mo ago

How can I tell what? That current LLMs are not conscious or that AGI/ASI will not require consciousness?

1 more reply

casey22mo ago

All of flapping, flying and intelligence are physical actions. If your "flying" machine can't get up to altitude fast enough to avoid small hills then it's not an adequate flying system.

unsupp0rted3mo ago

EternalFury3mo ago

So…calculators are intelligent? How about accountants that failed arithmetic 101 in high-school, are they intelligent? Generally intelligent?

mvkel2mo ago

Humans can do a lot of things that don't require intelligence. Artificial intelligence does not need to be 100% human to be AGI.

eddiewithzato2mo ago

It needs to pass the most basic concept of learning, which it can’t currently do. Probably wont ever do after listening to dario on his latest podcast run.

Where we are at today is ASI (artificial semi-intelligence). Maybe in 20 years artificial super intelligence can be achieved, but certainly not AGI.

mvkel2mo ago

Taking a cursory glance at neurosymbolic ai, artificial curiosity, memory, continual learning, I think you'd find that we're extremely close.

Davidzheng3mo ago

Important to remember that intelligence is not a singular thing and when the last gap is closed, most aspects will be highly superhuman

Raphael_Amiard3mo ago

pron3mo ago

Except there's a much simpler definition of flying than of intelligence.

jwpapi3mo ago· 16 in thread

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.

I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.

General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.

This so far has been the best test.

adamgordonbell3mo ago

AGI’s 'general' is the wrong word, I thinkg. Humans aren’t general, we’re jagged. Strong in some areas, weak in others, and already surpassed in many domains.

LLM are way past us at languages for instance. Calculators passed us at calculating, etc.

onlyrealcuzzo3mo ago

We don't call a calculator intelligent.

A calculator is extremely useful, but it is not intelligent.

A computer is extremely useful, but it is not intelligent.

Airplanes don't have wings, but they're damn sure useful, and also not intelligent.

If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.

They are extremely useful. But they are not AGI.

Words matter.

tomjakubowski3mo ago

> If LLMs cannot learn to beat not-that-difficult of games better than young teens, they are not intelligent.

1 more reply

GorbachevyChase2mo ago

A modern LLM in a loop with a harness for memory and behavior modification in a body would probably fool me.

3 more replies

tavianator3mo ago

> Airplanes don't have wings

???

jwpapi3mo ago

Interesting take.

Just to drive that thought further.

What are you suggesting, should we rename it. To me the fundamental question is this.

Do we still have tasks that humans can do better than AIs?.

I like the question. I think another good test is "make money". There are humans that can generate money from their laptop. I don’t think AI will be net positive.

I’ve tried to create a Polymarket trading bot with Opus 4.6. The ideas were full of logical fallacies and many many mistakes.

But also I’m not sure how they would compare against an average human with no statistics background..

I think it’s really to establish if we by AGI mean better than average human or better than best human..

adamgordonbell3mo ago

I don't have a good alternative sadly. Human Equivalent Intelligence? ChatGPT suggests "Systems that increasingly Pareto-dominate human intelligence across domains". Not so catchy.

suddenlybananas2mo ago

LLMs haven't passed us in language, a child can learn language with so so much less data than an LLM can

adamgordonbell2mo ago

isn't that more like rate of learning? Agreed LLM consume a lot of data.

But your average LLM understands more languages then anyone alive. So super human understanding of various text based languages.

1 more reply

EternalFury3mo ago

We are jagged, but we can smooth that jaggedness if we choose to do so. LLMs stay jagged.

1 more reply

Real_Egor3mo ago

I’d actually focus on something else entirely here.

- LLMs don't have spatial reasoning.

- LLMs don't have a lifetime of video game experience starting from childhood.

- LLMs don't have working memory or the ability to actually "memorize" key parameters on the fly.

- LLMs don't have an internal "world model" (one that actively adapts to real-world context and the actual process of playing a game).

... I could go on, but I've outlined the core requirements for beating these tests above.

suddenlybananas2mo ago

>But with LLMs, we are architecturally built the same way: it is a Neural Network that processes and makes decisions.

There are high-level similarities between ANNs and the human brain but they are very, very, very different in a ton of ways.

1 more reply

LZ_Khan3mo ago

The thing is.. this is more akin to testing a blind person's performance on a driving test than testing his intelligence.

I would imagine if you simply encoded the game in textual format and asked an LLM to come up with a series of moves, it would beat humans.

The problem here is more around perception than anything.

handoflixue2mo ago

I still agree that this is like declaring blind people lack human intelligence, of course.

scotty793mo ago

Especially since there are already models that can learn how to play 8-bit games.

It feels like ARC-AGI jumped the shark. But who knows, maybe people who train models for robots are going to take it in stride.

visarga2mo ago

It only tests puzzle solving, intelligence is cost compression that powers itself.

dinkblam3mo ago· 15 in thread

what is the evidence that being able to play games equates to AGI?

modeless3mo ago

The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.

Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.

observationist3mo ago

AI X that can solve the tests contrasted with AI Y that cannot, with all else being equal, means X is closer to AGI than Y. There's no meaningful scale implicit to the tests, either.

Gotta love the field of AI.

rolux3mo ago

Will there be a point in that series of ARC-AGI tests where AI can design the next test, or is designing the next text always going to be a problem that can be solved by humans and not AI?

modeless3mo ago

I don't see why AI couldn't design tests. But they can only be validated by humans, as they are intended to be possible and ideally easy for humans to solve.

1 more reply

famouswaffles3mo ago

>It proves you don't have AGI.

That Francois had to do all this nonsense should tell you the state of where we are right now.

ACCount373mo ago

None whatsoever.

It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.

The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.

furyofantares3mo ago

There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.

Anyway, from the article:

> As long as there is a gap between AI and human learning, we do not have AGI.

This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.

fragmede3mo ago

furyofantares3mo ago

I think there's a few factors, codebase size is one, and the tendency for vibe coding to be mostly additive certainly doesn't help with that.

observationist3mo ago

arscan3mo ago

I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.

It used to be easy to build these tests. I suspect it’s getting harder and harder.

fsdf23mo ago

The reality is machines can brute force endlessly to an extent humans cannot, and make it seem like they are intelligent.

Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.

sva_3mo ago

That is not the claim. It is a necessary condition, but not a sufficient one.

futureshock3mo ago

didibus3mo ago

> AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could

1 more reply

OsrsNeedsf2P3mo ago· 11 in thread

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

ZeWaka3mo ago

Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.

I still don't quite understand the exact mirroring rules at play.

ACCount373mo ago

You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.

daveguy2mo ago

This is the correct strategy for this particular game (center the mirrors between the yellow squares, move the black squares). I didn't realize it until about round 6 or 7.

danilor3mo ago

I got stuck on 7/8 for a good while because I learned the rules wrong. I thought every bracket square needed to be lit.

ustad3mo ago

You are joking right?

daemonologist3mo ago

IsTom3mo ago

The most difficult thing about this was controls being unresponsive (at least on firefox).

ball_of_lint3mo ago

solved first try with 577 actions, not trying hard to optimize for low action count.

programjames3mo ago

I think that is the tester's action count. Either that or we coincidentally got the exact same count.

rtrgrd2mo ago

I also see 577, so must be human testers action count. Also watched the replay, the solve seems different to mine.

fsdf23mo ago

I did the first round literally in 5 secs. How can you not 'get it'? lol

tasuki3mo ago· 10 in thread

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

muskstinks3mo ago

This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.

applfanboysbgon3mo ago

muskstinks3mo ago

The transfer of knowledge required here is that a ARC-AGI-3 is now necessary and adds another dimension of capability.

These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.

Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.

AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.

zarzavat3mo ago

Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI.

When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.

gordonhart3mo ago

The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!

futureshock3mo ago

I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.

tibbar3mo ago

The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.

didibus3mo ago

It helps the model makers have a harness to optimize for in their next model version.

They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.

They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.

The point is that, a new version of ARC-AGI should help the next model be smarter.

refulgentis3mo ago

You’re absolutely right to point it out.

I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.

minimaxir3mo ago

It's semvar.

Real_Egor3mo ago· 8 in thread

I'll probably be the skeptic here, but:

- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.

- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.

As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.

This is not AGI at all.

dgfl3mo ago

Real_Egor3mo ago

This has absolutely nothing to do with AGI. Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.

The way I see it, the true formula for AGI is: [Brain] + [External Sensors] (World Receptors) + [Internal State Sensors] + [Survival Function] + [Memory].

These current benchmarks aren't bringing us any closer to AGI. They merely prove that we've found a new layer of tasks that we simply haven't figured out how to train LLMs on yet.

wise_blood2mo ago

> . Once they beat these tests, new ones will pop up. They'll beat those, and people will invent the next batch.

that's exactly the point! once we cannot invent the next batch (that is easy for humans to solve), that will be AGI

jpadkins2mo ago

good post, but I disagree Surival Function is needed for AGI. Why do you think Survival Function is needed?

The item I think you should add is a Mesolimbic System (Reward / Motivation). I think AGI needs motivation to direct its learning and tasks.

ehsankia2mo ago

> Isn’t this what AGI is by design?

Well, the "G" in AGI is kinda important. These are specifically games/puzzles.

> they have to be retrained from scratch

Is that true? Didn't DeepMind already build plenty of agents that are generally good at most computer games without being retrained?

jpadkins2mo ago

slidehero3mo ago

had the same thought.

I've been a gamer for just about 40 years. Gaming is my "thing"

I found the challenges fun, but easy. Coming back and reading comments from people struggling with the games, my first thought was - yup definitely not a gamer.

My approach was to poke at the controls to suss the rules, then the actual solutions were really straightforward.

fwiw, I'm pretty dumb generally, but these kinds of puzzles are my jam.

Real_Egor3mo ago

Bingo! That's exactly what I meant

semiinfinitely3mo ago· 7 in thread

i feel bad that we make the LLMs play this

recursive3mo ago

You're definitely anthropomorphizing too much.

WarmWash3mo ago

https://openai.com/index/how-we-monitor-internal-coding-agen...

Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...

nomel3mo ago

That's all probably irrelevant though, from the (possibly statistically "negative") latent space perspective of an AI, which Anthropic has considered [1].

[1] https://www.anthropic.com/research/end-subset-conversations

recursive3mo ago

If this is a serious risk we should pull the plug now while we can still reach it. If we have to rely on the mood and temperament of LLMs for security, we're already lost.

1 more reply

tingletech3mo ago

I agree that anthropomorphizing is a real risk with LLMs, but what about zoomorphizing? Can feel bad for LLMs without attributing them human emotions/motivations/reasoning?

recursive2mo ago

In the same way you could feel bad for a pokemon I guess.

fsdf23mo ago

tell me youre joking.

seriously. lmao. if you aint, I dunno what to say.

nubg3mo ago· 6 in thread

Any benchmarks?

gordonhart3mo ago

The main frontier models are all up on https://arcprize.org/tasks

ACCount373mo ago

Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.

Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.

This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.

gordonhart3mo ago

culi3mo ago

thatguymike3mo ago

Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard

gordonhart3mo ago

The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.

typs3mo ago· 5 in thread

My takeaway from playing a number of levels is that I am definitely not AGI

Xenoamorphous3mo ago

NGI - Natural General Ingelligence

dyauspitr3mo ago

SGI - Sub General Intelligence or another more colloquial word commonly seen amongst users of wallstreetbets.

utopiah3mo ago

Don't forget that this implies a form of examination you are not used to, namely :

- arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously

- no shame in submitting a very large amount of wrong answers until you get the "right" one

... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.

ACCount373mo ago

Thank you for keeping the bar of "AGI" low. The machines appreciate your contribution.

Rastonbury2mo ago

it's ok it took me a few tried to realise I had the option to click instead of just wasd

Stevvo3mo ago· 5 in thread

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

Barbing3mo ago

It's not about intelligence, Stevvo. Proof, how long did this specific one take me, under a minute to solve the first level ;)

If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.

culi3mo ago

It's not an IQ test. Just a way to assess your ability to generalize rules. If you've played previous rounds you kinda get used to the "style" of these games and it gets easier

ACCount372mo ago

That's exactly what "an IQ test" is.

WarmWash3mo ago

Once you figure out one game, it goes a long way towards figuring out all the rest. There are a lot of common general themes.

neop1x2mo ago

Exactly my experience. It has nothing to do wirh some AGI testing. It is just some kind of useless weird game.

mvkel2mo ago· 4 in thread

Was just at the YC launch event for this. Haven't felt this much inspiration in a while. Incredible minds confronting on tech that will change our society.

Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.

That's really exciting.

vonneumannstan2mo ago

>Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.

Quintessential goal post moving...

mvkel2mo ago

If you read the charter of the eval (or any eval, really), this statement is pretty silly.

dwaltrip2mo ago

HN occasionally devolves into “supremely pedantic and nitpicky” mode. Today is one of those days.

vonneumannstan2mo ago

MadxX792mo ago· 4 in thread

Same question I have for all these benchmarks:

vessenes2mo ago

Wrong question. I suggest:

1) Do models generalize?

2) If they do, and they generalize from this, is that a win?

daveguy2mo ago

Can AI models generalize+ at any long context problem solving and agency regardless of modality? I think the answer is no, and this is why they are not yet AGI.

+ generalize being the key word.

MadxX792mo ago

Yeah, so you are agreeing that the benchmarks are useless because they don't answer those questions.

nearbuy2mo ago

They would score much worse on the private set than the public set. And they haven't done this for any of the other ARC-AGI benchmarks, so why would they do it for this one?

spprashant3mo ago· 4 in thread

I played the demo, but it definitely took me a minute to grok the rules.

I don't know if this is how we want to measure AGI.

esafak3mo ago

> ... lets focus on how we can augment and empower the existing workforce.

That is a nice sentiment but not what the AI companies are out to do; they want your job.

jachee3mo ago

fsdf23mo ago

Took me about 5 secs to figure it out tbh.

Surprised at the comments here re. not figuring it. Simple game. Super annoying though lmao.

spprashant3mo ago

Its simple, but its not easy is what I would say. Once you figure out the meta, you can work out most of it.

lukev3mo ago· 3 in thread

I'm not sure how this relates to AGI.

This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.

Humans may or may not be good at the same class of games.

We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.

Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

piiritaja3mo ago

Keyframe3mo ago

imiric3mo ago

So there is a business application, but no practical or philosophical one.

CamperBob23mo ago· 3 in thread

WarmWash3mo ago

The goal is to learn the rules, and then use that to win.

If you mess around a little bit, you will figure it out. There are only a few rules.

szatkus3mo ago

> Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.

Apparently those games supposed to be hard.

dwaltrip2mo ago

If you tried for a few more minutes you would have figured it out.

jesse_dot_id3mo ago· 3 in thread

At this point, I'm pretty sure we'll just know when it happens.

hatthew3mo ago

Without a big jump, we're just going to boil the frog (ourselves).

neilellis3mo ago

Unless it’s already happened and we missed it

threatripper3mo ago

Or nobody is around anymore to notice when it happens.

tantalor2mo ago· 2 in thread

The controls just feel really bad. The inputs are too small, and there is way too much lag.

roflcopter692mo ago

tantalor2mo ago

I just checked, the size of the controls are 28x28

The minimum recommended size for mobile is 44x44

vonneumannstan2mo ago· 2 in thread

daveguy2mo ago

No system can crack these out of the box (like humans can) because we don't have AGI.

vonneumannstan2mo ago

Yeah I mean ChatGPT 5.4 Pro can't even pick my nose for me so it's obviously not AGI /s

6thbit3mo ago· 2 in thread

Not clear to me the diff with v2?

ACCount373mo ago

Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.

jasonjmcghee3mo ago

v2 was a static fill in the blank task instead of v3 which is interactive.

There's world state that you can change. Not just place pixel.

Here's v2:

https://arcprize.org/tasks/ce602527

Zedseayou3mo ago· 1 in thread

eddiewithzato2mo ago

I understood minimal actions intuitively, it just made sense? The stamina meter was shrinking with each step, so I recognized it was something to look out for.

cedws3mo ago· 1 in thread

It's like playing The Witness. Somebody should set LLMs loose on that.

throwaway6137463mo ago

Or more appropriately - The Talos Principle.

k2xl3mo ago· 1 in thread

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.

It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

mycocola2mo ago

Seems well-designed. Great job! Sorry you didn't hear back from the comittee.

chaise3mo ago· 1 in thread

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)

CRAZY 0.1% in average lmao

Corence3mo ago

So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.

culi3mo ago

strongpigeon3mo ago

It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

vessenes2mo ago

I’m not a Chollet booster. Well, I might be a little bit of one in that I admire his persistence.

- Visual reasoning

- Path planning (and some fairly long paths)

- Mouse/screen interaction

- color and shape analysis

- cross-context learning/remembering

andai3mo ago

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

ranyume3mo ago

visarga2mo ago

arjie3mo ago

Perhaps actual AGI will be when the models create ARC-HGI-1 to test if humans have general intelligence.

baron8163mo ago

Looks like I’m generally unintelligent

WarmWash3mo ago

Captcha's about to get wild.

Maybe the internet will briefly go back to a place mainly populated with outliers.

convexly3mo ago

My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.

abraxas3mo ago

vonneumannstan2mo ago

>As long as there is a gap between AI and human learning, we do not have AGI.

levmiseri3mo ago

For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena

dsfadfasdf2mo ago

Can someone clarify if image inputs are allowed, so VLMs can be used? I have not been able to get information anywhere.

EternalFury3mo ago

The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.

NiloCK3mo ago

I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.

j10002mo ago

I feel like AGI test would be sense of humor. Somehow I cannot force any LLM to output any even normal level joke.

nick494881712mo ago

Arc AGI 4 can be Chip's Challenge!

jmkni3mo ago

ok clearly I'm a robot because I can't figure out wtf to do

largbae3mo ago

I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.

If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.

Geee3mo ago

Would be fun to play but the controls are janky.

saberience3mo ago

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?

Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.

The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.

Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?

This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

Wintamute2mo ago

Unplayably laggy on an iPhone. Sad people can’t produce a performant experience that a ZX81 could have eaten for breakfast, on a relative super computer

baalimago2mo ago

You can tell it's an AI by it not becoming utterly by playing the "game". I could personally not stand any more than the first level.

aogaili2mo ago

honestly the most interesting thing about ARC-AGI-3 isn't the 0.25% scores everyone is doomposting about. it's the Duke harness result.

38362936483mo ago

Ew. Cool demo, what idiot thought it was ok to have a half second cooldown between inputs? If I hit up three times I should move up three steps, not two steps because I pressed too quickly.

elAhmo2mo ago

I find it quite funny that we are still debating whether models are intelligent or not, while we know they are just statistical models.

j / k navigate · click thread line to collapse