ARC Prize – a $1M+ competition towards open AGI progress (opens in new tab)

(arcprize.org)

588 pointsmikeknoop2y ago337 comments

Hey folks! Mike here. Francois Chollet and I are launching ARC Prize, a public competition to beat and open-source the solution to the ARC-AGI eval.

ARC-AGI is (to our knowledge) the only eval which measures AGI: a system that can efficiently acquire new skill and solve novel, open-ended problems. Most AI evals measure skill directly vs the acquisition of new skill.

Francois created the eval in 2019, SOTA was 20% at inception, SOTA today is only 34%. Humans score 85-100%. 300 teams attempted ARC-AGI last year and several bigger labs have attempted it.

While most other skill-based evals have rapidly saturated to human-level, ARC-AGI was designed to resist “memorization” techniques (eg. LLMs)

Solving ARC-AGI tasks is quite easy for humans (even children) but impossible for modern AI. You can try ARC-AGI tasks yourself here: https://arcprize.org/play

ARC-AGI consists of 400 public training tasks, 400 public test tasks, and 100 secret test tasks. Every task is novel. SOTA is measured against the secret test set which adds to the robustness of the eval.

Solving ARC-AGI tasks requires no world knowledge, no understanding of language. Instead each puzzle requires a small set of “core knowledge priors” (goal directedness, objectness, symmetry, rotation, etc.)

At minimum, a solution to ARC-AGI opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. At maximum, unlocks the tech tree towards AGI.

Our goal with this competition is:

1. Increase the number of researchers working on frontier AGI research (vs tinkering with LLMs). We need new ideas and the solution is likely to come from an outsider! 2. Establish a popular, objective measure of AGI progress that the public can use to understand how close we are to AGI (or not). Every new SOTA score will be published here: https://x.com/arcprize 3. Beat ARC-AGI and learn something new about the nature of intelligence.

Happy to answer questions!

ARC Prize – a $1M+ competition towards open AGI progress

(arcprize.org)

588 pointsmikeknoop2y ago337 comments

Hey folks! Mike here. Francois Chollet and I are launching ARC Prize, a public competition to beat and open-source the solution to the ARC-AGI eval.

Francois created the eval in 2019, SOTA was 20% at inception, SOTA today is only 34%. Humans score 85-100%. 300 teams attempted ARC-AGI last year and several bigger labs have attempted it.

While most other skill-based evals have rapidly saturated to human-level, ARC-AGI was designed to resist “memorization” techniques (eg. LLMs)

Solving ARC-AGI tasks is quite easy for humans (even children) but impossible for modern AI. You can try ARC-AGI tasks yourself here: https://arcprize.org/play

Our goal with this competition is:

Happy to answer questions!

337 comments

197 comments · 64 top-level

salamo2y ago· 33 in thread

This is super cool. I share Francois' intuition that the presently data-hungry learning paradigm is not only not generalizable but unsustainable: humans do not need 10,000 examples to tell the difference between cats and dogs, and the main reason computers can today is because we have millions of examples. As a result, it may be hard to transfer knowledge to more esoteric domains where data is expensive, rare, and hard to synthesize.

If I can make one criticism/observation of the tests, it seems that most of them reason about perfect information in a game-theoretic sense. However, many if not most of the more challenging problems we encounter involve hidden information. Poker and negotiations are examples of problem solving in imperfect information scenarios. Smoothly navigating social situations also requires a related problem of working with hidden information.

One of the really interesting things we humans are able to do is to take the rules of a game and generate strategies. While we do have some algorithms which can "teach themselves" e.g. to play go or chess, those same self-play algorithms don't work on hidden information games. One of the really interesting capabilities of any generally-intelligent system would be synthesizing a general problem solver for those kinds of situations as well.

com2kid2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs,

I swear, not enough people have kids.

Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

One thing kids do is they'll ask for confirmation of their guess. You'll be reading a book you've read 50 times before and the kid will stop you, point at a dog in the book, and ask "dog?"

And there is a development phase where this happens a lot.

Also kids can get mad if they are told an object doesn't match up to the expected label, e.g. my son gets really mad if someone calls something by the wrong color.

Another thing toddlers like to do is play silly labeling games, which is different than calling something the wrong name on accident, instead this is done on purpose for fun. e.g. you point to a fish and say "isn't that a lovely llama!" at which point the kid will fall down giggling at how silly you are being.

The human brain develops really slowly[1], and a sense of linear time encoding doesn't really exist for quite awhile. (Even at 3, everything is either yesterday, today, or tomorrow) so who the hell knows how things are being processed, but what we do know is that kids gather information through a bunch of senses, that are operating at an absurd data collection rate 12-14 hours a day, with another 10-12 hours of downtime to process the information.

[1] Watch a baby discover they have a right foot. Then a few days later figure out they also have a left foot. Watch kids who are learning to stand develop a sense of "up above me" after they bonk their heads a few time on a table bottom. Kids only learn "fast" in the sense that they have nothing else to do for years on end.

PheonixPharts2y ago

> Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

I have kids so I'm presuming I'm allowed to have an opinion here.

This is ignoring the fact that babies are not just learning labels, they're learning the whole of language, motion planning, sensory processing, etc.

Once they have the basics down concept acquisition time shrinks rapidly and kids can easily learn their new favorite animal in as little as a single example.

Compare this to LLMs which can one-shot certain tasks, but only if they have essentially already memorized enough information to know about that task. It gives the illusion that these models are learning like children do, when in reality they are not even entirely capable of learning novel concepts.

Beyond just learning a new animal, humans are able to learn entirely new systems of reasoning in surprisingly few examples (though it does take quite a bit of time to process them). How many homework questions did your entire calc 1 class have? I'm guessing less than 100 and (hopefully) you successfully learned differential calculus.

7 more replies

9cb14c1ec02y ago

> not enough people have kids.

Second that. I think I've learned as much as my children have.

> Watch a baby discover they have a right foot. Then a few days later figure out they also have a left foot.

Watching a baby's awareness grow from pretty much nothing to a fully developed ability to understand the world around is one of the most fascinating parts of being a parent.

smusamashah2y ago

My kid is about 3 and has been slow on language development. He can barely speak a few short sentences now. Learning names of things and concepts made a big difference for him and that's a fascinating watch and realization.

This reminds of the story of Adam learning names, or how some languages can express a lot more in fewer words. And it makes sense that LLMs look intelligent to us.

My kid loves repeating the names of things he learned recently. For past few weeks, after learning 'spider' and 'snake' and 'dangerous' he keeps finding spiders around, no snakes so makes up snakes from curly drawn lines and tells us they are dangerous.

I think we learn fast because of stereo (3d) vision. I have no idea how these models learn and don't know if 3d vision will make multi model LLMs better and require exponentially less examples.

1 more reply

Nition2y ago

> the kid will stop you, point at a dog in the book, and ask "dog?"

Of course for a human this can either mean "I have an idea about what a dog is, but I'm not sure whether this is one" or it can mean "Hey this is a... one of those, what's the word for it again?"

llm_trw2y ago

Babies, unlike machine learning models, aren't placed in limbo when they aren't running back propagation.

Babies need few examples for complex tasks because they get constant infinitely complex examples on tasks which are used for transfer learning.

Current models take a nuclear reactors worth of power to run back prop on top of a small countries GDP worth of hardware.

They are _not_ going to generalize to AGI because we can't afford to run them.

1 more reply

1024core2y ago

> I swear, not enough people have kids.

My friends toddler, who grew up with a cat in the house, would initially call all dogs "cat". :-D

2 more replies

resource0x2y ago

I haven't seen 1000 cats in my entire life. I'm sure I learned how to tell a dog from a cat after being exposed to just a single instance of each.

1 more reply

cess112y ago

I have a small kid. When they first saw some jackdaws, the first bird they noticed could fly, they thought it was terribly exciting and immediately learned the word for them, and generalised it to geese, crows, gulls and magpies (plus some less common species I don't know what they're called in english), pointing at them and screaming the equivalent of 'jackda! jackda!'.

PontifexMinimus2y ago

> Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

If I was presented with 10 pictures of 2 species I'm unfamiliar with, about as different as cats and dogs, I expect I would be able to classify further images as either, reasonably accurately.

ein0p2y ago

Not to mention that babies receive petabytes of visual input to go with other stimuli. It’s up for debate how sample efficient humans actually are in the first few years of their lives.

1 more reply

Auracle2y ago

That’s all true, yet my 2.5 year old sometimes one-shots specific information. I told my daughter that woodpeckers eat bugs out of trees after doing what you said and asking “what’s that noise?” for the fifth time in a few minutes when we heard some this spring. She brought it up again at least a week later, randomly. Developing brains are amazing.

She also saw an eagle this spring out the car window and said “an eagle! …no, it’s a bird,” so I guess she’s still working on those image classifications ;)

bamboozled2y ago

I think your comment over intellectualises the way children experience the world.

My child experiences the world in a really pure way. They don’t care much about labels or colours or any other human inventions like that. He picks up his carrot, he doesn’t care about the name or the color . He just enjoys it through purely experiencing eating it. He can also find incredible flow state like joy from playing with river stones or looking at the moon.

I personally feel bad I have to each them to label things and but things in boxes. I think your child is frustrated at times because it’s a punish of a game. The departure from “the oceanic feeling.

Your comment would make sense to me if the end game of our brains and human experience is labelling things. It’s not. It’s useful but it’s not what living is about.

theptip2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

The optimization process that trained the human brain is called evolution, and it took a lot more than 10,000 examples to produce a system that can differentiate cats vs dogs.

Put differently, an LLM is pre-trained with very light priors, starting almost from scratch, whereas a human brain is pre-loaded with extremely strong priors.

PaulDavisThe1st2y ago

> The optimization process that trained the human brain is called evolution, and it took a lot more than 10,000 examples to produce a system that can differentiate cats vs dogs.

Asserted without evidence. We have essentially no idea at what point living systems were capable of differentiating cats from dogs (we don't even know for sure which living systems can do this).

1 more reply

llm_trw2y ago

>The optimization process that trained the human brain is called evolution

A human brain that doesn't get visual stimulus at the critical age between 0 and 3 years old will never be able to tell the difference between a cat and a dog because it will be forevermore blind.

2 more replies

pants22y ago

Humans, I would bet, could distinguish between two animals they've never seen based only on a loose or tangential description. I.e. "A dog hunts animals by tracking and chasing them long enough to exhaust their energy, but a cat is opportunistic and strikes using stealth and agility."

A human that has never seen a dog or a cat could probably determine which is which based on looking at the two animals and their adaptations. This would be an interesting test for AIs, but I'm not quite sure how one would formulate a eval for this.

taneq2y ago

Only after being exposed to (at least pictures and descriptions of) dozens if not hundreds of different types of animal and their different attributes. Literal decades of training time and carefully curated curriculum learning are required for a human to perform at what we consider ‘human level’.

ryankrage772y ago

A possible way to this idea would be to draw two aliens with different hunting strategies and do a poll of which is which. I'd try it but my drawing skills are terrible and I'm averse to using generated images.

tigerlily2y ago

Seems analogous to bouba/kiki effect:

https://en.m.wikipedia.org/wiki/Bouba/kiki_effect

jules2y ago

Do computers need 10,000 examples to distinguish dogs from cats when pretrained on other tasks?

curious_cat_1632y ago

No.

VirusNewbie2y ago

>: humans do not need 10,000 examples to tell the difference between cats and dogs

well, maybe. We view things in three dimensions at high fidelity: viewing a single dog or cat actually ends up being thousands of training samples, no?

amelius2y ago

Yes, but we do not call a couch in a leopard print a leopard. Because we understand that the print is secondary to the function.

3 more replies

bbor2y ago

Eh, still doesn’t hold up. I really don’t think there’s many psychologists working on the posited mechanism of simple NN-like backprop learning. Aka conditioning, I guess. As Chomsky reminds us every time we let him: human children learn to understand and use language — an incredibly complex and nuanced domain, to say the least — with shockingly little data and often zero-to-none intentional instruction. We definitely employ principles and patterns that are far more complex (more “emergent”?) than linear regression.

Tho I only ever did undergrad stats, maybe ML isn’t even technically a linear regression at this point. Still, hopefully my gist is clear

2 more replies

AIorNot2y ago

There’s a great episode from Darkwish Patels podcast discussing this today

https://youtu.be/UakqL6Pj9xo?si=iDH6iSNyz1Net8j7

nphard852y ago

Dwarkesh*

goertzen2y ago

I don’t know enough of biology or genetics or evolution, but surely the millions of years of training that is hardcoded into our genes and expressed in our biology had much larger “training” runs.

allanrbo2y ago

If a human eye works at say 10 fps, then 8 minutes with a cat is about 10k images :-D

captaincaveman2y ago

I'd say that was more like a single instance, one interaction with a thing.

2 more replies

fennecbutt2y ago

Humans don't need those examples because our brains are very pretrained. Natural fear of snakes and snakelike things, etc etc.

ML models are starting from absolute zero, single celled organism level.

woadwarrior012y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

Neither do machines. Lookup few-shot learning with things like CLIP.

nextaccountic2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

Humans learn through a lifetime.

Or are we talking about newborn infants?

lacker2y ago· 13 in thread

I really like the idea of ARC. But to me the problems seem like they require a lot of spatial world knowledge, more than they require abstract reasoning. Shapes overlapping each other, containing each other, slicing up and reassembling pieces, denoising regular geometric shapes, you can call them "core knowledge" but to me it seems like they are more like "things that are intuitive to human visual processing".

Would an intelligent but blind human be able to solve these problems?

I'm worried that we will need more than 800 examples to solve these problems, not because the abstract reasoning is so difficult, but because the problems require spatial knowledge that we intelligent humans learn with far more than 800 training examples.

modeless2y ago

> to me it seems like they are more like "things that are intuitive to human visual processing".

Yann LeCun argues that humans are not general intelligence and that such a thing doesn't really exist. Intelligence can only be measured in specific domains. To the extent that this test represents a domain where humans greatly outperform AI, it's a useful test. We need more tests like that, because AIs are acing all of our regular tests despite being obviously less capable than humans in many domains.

> the problems require spatial knowledge that we intelligent humans learn with far more than 800 training examples.

Pretraining on unlimited amounts of data is fair game. Generalizing from readily available data to the test tasks is exactly what humans are doing.

> Would an intelligent but blind human be able to solve these problems?

I'm confident that they would, given a translation of the colors to tactile sensation. Blind humans still understand spatial relationships.

HarHarVeryFunny2y ago

I just did the first 5 of the "public eval set" without having looked at the "public training set", and found them easy enough. If we're defining AGI as at least human level, then the AGI should also be able to do these without seeing any more examples.

I don't think there's any rules about what knowledge/experience you build into your solution.

mewpmewp22y ago

AGI should obviously be able to do them. But AI being able to do those 100 percent wouldn't be evidence of AGI however. It is a very narrow domain.

2 more replies

nickpsecurity2y ago

To parent: the spatial reasoning and blind person were great counterexamples. It still might be OK despite the blind exceptions if it showed general reasoning.

To OP: I like your project goal. I think you should look at prior, reasoning engines that tried to build common sense. Cyc and OpenMind are examples. You also might find use for the list of AGI goals in Section 2 of this paper:

https://arxiv.org/pdf/2308.04445

When studying intros of brain function, I also noted many regions tie into the hippocampus which might do both sense-neutral storage of concepts and make inner models (or approximations) of external world. The former helps tie concepts together through various senses. The latter helps in planning when we are imagining possibilities to evaluate and iterate on them.

Seems like AGI should have these hippocampus-like traits and those in the Cyc paper. One could test if an architecture could do such things in theory or on a small scale. It shouldn’t tie into just one type of sensory input either. At least two with the ability to act on what only exists in one or what is in both.

Edit: Children also have an enormous amount of unsupervised training on visual and spatial data. They get reinforcement through play and supervised training by parents. A realistic benchmark might similarly require GB of prettaining.

HarHarVeryFunny2y ago

CYC was an expert system, which is arguably what LLMs are.

A similar vintage GOFAI project that might do better on these, with a suitable visual front end, is SOAR - a general purpose problem solver.

1 more reply

andoando2y ago

I would argue that spatial reasoning encompasses all reasoning. All the things you mentioned have a direct analogue to abstract models and logic we employ and are engrained deeply into language. For example, shapes containing eachother:

There are two countries both which lay claim to the same territory. There is a set X that contains Y and there is a set Z that contains Y. In the case that the common overlap is 3D and one in on top of the other, we can extend this to there is a set X that contains -Y and a set Z that contains Y, and just as you can only see one on top and not both depending on where you stand, we can apply the same property here and say set X and Z cannot both exist, and therefore if set X is on then -Y and if set Z then Y.

If you pay attention to the language you use youll start to realize how much of it uses spatial relationships to describe completely abstract things. For example, one can speak of disintigrating hegonomic economies. i.e turning things built on top of eachother into nothing, to where it came

We are after all, reasoning about things which happen in time and space.

And spatial != visual. Even if you were blind youd have to reason spatially, because again any set of facts are facts in space-time. What does it take to understand history? People in space, living at various distances from each other, producing goods from various locations of the earth using physical processes, and physically exchanging them. To understand battles you have to understand how armies are arranged physically, how moving supplies works, weather conditions, how weapons and their physical forms affect what they can physically do, etc.

Hell LLMs, the largest advancement we had in artificial intelligence do what exactly? Encode tokens into multi dimensional space.

parentheses2y ago

Spatial reasoning is easily isomorphic to many kinds of reasoning - just not all of them. Spatial reasoning in this case also limits the AI to 2 dimensions. I concede that with more dimensions, there will be more isomorphisms.

Is there a number of dimensions that captures all reasoning? I don't know..

1 more reply

CooCooCaCha2y ago

“Would an intelligent but blind human be able to solve these problems?”

This is the wrong way to think about it IMO. Spatial relationships are just another type of logical relationship and we should expect AGI to be able to analyze relationships and generate algorithms on the fly to solve problems.

Just because humans can be biased in various ways doesn’t mean these biases are inherent to all intelligences.

crazygringo2y ago

> Spatial relationships are just another type of logical relationship and we should expect AGI to be able to analyze relationships and generate algorithms on the fly to solve problems.

Not really. By that reasoning, 5-dimensional spatial reasoning is "just another type of logical relationship" and yet humans mostly can't do that at all.

It's clear that we have incredibly specialized capabilities for dealing with two- and three-dimensional spatiality that don't have much of anything to do with general logical intelligence at all.

2 more replies

janalsncm2y ago

Part of the concern might be that visual reasoning problems are overrepresented in ARC in the space of all abstract reasoning problems.

It’s similar to how chess problems are technically reasoning problems but they are not representative of general reasoning.

1 more reply

dimask2y ago

> Would an intelligent but blind human be able to solve these problems?

Blind people can have spatial reasoning just fine. Visual =/= spatial [0]. Now, one would have to adapt the colour-based tasks to something that would be more meaningful for a blind person, I guess.

[0] https://hal.science/hal-03373840/document

Lerc2y ago

I don't think the intent is to learn the entire problem domain from the examples, but the specific rule that is being applied.

There may (almost certainly will be) additional knowledge encoded in the solver to cover the spacial concepts etc. The distinction with the AGI-ARC test is the disparity between human and AI performance, and that it focuses on puzzles that are easier for humans.

It would be interesting to see a finetuned LLM just try and express the rule for each puzzle as english. It could have full knowledge of what ARC-AGI is and how the tests operate, but the proof of the pudding is simply how it does on the test set.

lynx232y ago

If a blind individual can solve a visually oriented challenge is not really a question of their intelligence but more a question of accessibility/translation. Just because I cant see something myself doesnt really say anything about my ability to deal with abstractions.

pmayrgundter2y ago· 10 in thread

This claim that these tests are easy for humans seems dubious, and so I went looking a bit. Melanie Mitchell chimed in on Chollet's thread and posted their related test [ConceptARC].

In it they question the ease of Chollet's tests: "One limitation on ARC’s usefulness for AI research is that it might be too challenging. Many of the tasks in Chollet’s corpus are difficult even for humans, and the corpus as a whole might be sufficiently difficult for machines that it does not reveal real progress on machine acquisition of core knowledge."

ConceptARC is designed to be easier, but then also has to filter ~15% of its own test takers for "[failing] at solving two or more minimal tasks... or they provided empty or nonsensical explanations for their solutions"

After this filtering, ConceptARC finds another 10-15% failure rate amongst humans on the main corpus questions, so they're seeing maybe 25-30% unable to solve these simpler questions meant to test for "AGI".

ConceptARC's main results show CG4 scoring well below the filtered humans, which would agree with a [Mensa] test result that its IQ=85.

Chollet and Mitchell could instead stratify their human groups to estimate IQ then compare with the Mensa measures and see if e.g. Claude3@IQ=100 compares with their ARC scores for their average human

[ConceptArc]https://arxiv.org/pdf/2305.07141 [Mensa]https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

mikeknoopOP2y ago

Here is some published research on the human difficulty of ARC-AGI: https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.p...

> We found that humans were able to infer the underlying program and generate the correct test output for a novel test input example, with an average of 84% of tasks solved per participant

kenjackson2y ago

I just tried the first puzzle and I can't get it right. I think my solution makes logical sense and I explain why the patterns are consistent with the input, but it says its wrong. I'm either a lot dumber than I thought or they need to do a better job of vetting their tests.

mikeknoopOP2y ago

(You can direct link to a task like this: https://arcprize.org/play?task=009d5c81 in case you want to share!)

saati2y ago

It's pretty easy, just follow the second example with the colors from the test input. (if it's the same puzzle 00576224 for you too)

1 more reply

salamo2y ago

They claim that the average score for humans is between 85% and 100%, so I think there's a disagreement on whether the test is actually too hard. Taking them at their word, if no existing model can score even half what the average human can, the test is certainly measuring some kind of significant difference.

I guess there might be a disagreement of whether the problems in ARC are a representative sample of all of the possible abstract programs which could be synthesized, but then again most LLMs are also trained on human data.

gkbrk2y ago

The tasks are very easy for humans. Out of the 6 tasks assigned when I opened the web page, I got all of them correct on the first try.

Maybe if you run into some exceptionally difficult tasks it might not be 100%, but there's no way the challenge can be called unfair because it's too difficult for humans too.

mark_l_watson2y ago

I saw Melanie’s post and I am intrigued by an easier AGI suite. I would like some experimenting done by individuals like myself snd smaller organizations.

bbor2y ago

Are you working on (a book detailing) AGI also? It’s a lonely field but I have no doubt there are a sea of malcontent engineers across the world who saw the truth early on and are pushing solo for AGI. It’s going well for me, but I’m not sure whether to take that as “you’re great” or “it’s really that easy”, so was interested to see such a fellow brazen American on HN of all places.

Game on for the million, if so :). If not, apologies for distracting from the good fight for OSS/noncorp devs!

E: it occurred to me on the drive home how easily we (engineers) can fall into competitiveness, even when we’ve all read the thinkpieces about why an AI Race would/will be/is incredibly dangerous. Maybe not “game on”, perhaps… “god I hope it’s impossible but best of luck anyway to both of us”?

neoneye22y ago

Melanie is coauthor/supervisor of ConceptARC, that can be tried here: https://neoneye.github.io/arc/?dataset=ConceptARC

PaulDavisThe1st2y ago

You actually think that has not been going for 30, 40 or 50 years?

paxys2y ago· 10 in thread

While I agree with the spirit of the competition, a $1M prize seems a little too low considering tens of billions of dollars have already been invested in the race to AGI, and we will see many times that put into the space in the coming years. The impact of AGI will be measured in trillions at minimum. So what you are ultimately rewarding isn't AGI research but fine tuning the newest public LLM release to best meet the parameters of the test.

I'd also urge you to use a different platform for communicating with the public because x.com links are now inaccessible without creating an account.

mikeknoopOP2y ago

I agree, $1M is ~trivial in AI. The primary goal with the prize is to raise public awareness about how close (or far today) we are from AGI: https://arcprize.org/leaderboard and we hope that understanding will shift more would-be AI researchers to working new ideas

bongodongobob2y ago

That was my initial reaction too.

"Endow circuitry with consciousness and win a gift certificate for Denny's (may not be used in conjunction with other specials)"

hackerlight2y ago

The $1M ARC prize is advertising, just like being #1 on the huggingface leaderboard. It won't matter for end consumers, but for attracting the best talent it could be valuable.

cma2y ago

They thought of that and so have yearly $100,000 in yearly prizes for the best results as well, so things can build up towards someone winning the $1 million over time: the yearly prizes require you to publish the techniques.

elicksaur2y ago

The leaderboard is on the website. What medium should they use? https://arcprize.org/leaderboard

ks20482y ago

The submissions can't use the internet. And I imagine can't be too huge - so you can't use "newest public LLMs" on this task.

mikeknoopOP2y ago

That is correct for ARC Prize: limited Kaggle compute (to target efficiency) and no internet (to reduce cheating).

We are also trialing a secondary leaderboard called ARC-AGI-Pub that imposes no limits or constraints. Not part of the prize today but could be in the future: https://arcprize.org/leaderboard

cma2y ago

Using the internet would leak the test data, a big problem with ML benchmarks, and also allow communication with humans during the test.

lxgr2y ago

Yeah, I also immediately had Dr. Evil narrating the prize money amount in my head once I saw it.

AGI will take much more than that to build, and once you have it, if all you can monetize it for is a million dollars, you must be doing something extremely wrong.

btbuildem2y ago

Yeah, in 2006 Netflix offered $1M in a similar scheme. At least back then that sum meant something.

neoneye22y ago· 7 in thread

I'm Simon Strandgaard and I participated in ARCathon 2022 (solved 3 tasks) and ARCathon 2023 (solved 8 tasks).

I'm collecting data for how humans are solving ARC tasks, and so far collected 4100 interaction histories (https://github.com/neoneye/ARC-Interactive-History-Dataset). Besides ARC-AGI, there are other ARC like datasets, these can be tried in my editor (https://neoneye.github.io/arc/).

I have made some videos about ARC:

Replaying the interaction histories, and you can see people have different approaches. It's 100ms per interaction. IRL people doesn't solve task that fast. https://www.youtube.com/watch?v=vQt7UZsYooQ

When I'm manually solving an ARC task, it looks like this, and you can see I'm rather slow. https://www.youtube.com/watch?v=PRdFLRpC6dk

What is weird. The way that I implement a solver for a specific ARC task is much different than the way that I would manually solve the puzzle. Having to deal with all kinds of edge cases.

Huge thanks to the team behind the ARC Prize. Well done.

parentheses2y ago

The UX of your solution entry is _way_ better than the ARC site itself.

mkl2y ago

Being able to hold the mouse button down is certainly much nicer. Not being able to see the examples while you are solving makes it harder than it should be though.

1 more reply

neoneye22y ago

That warms my heart. Thank you.

The short story. I needed something that could render thumbnails of tasks, so I could visual debug what was going on in my solver. However I have never gotten around to make the visual inspection tool. After having the thumbnail renderer, mid january 2024, then it eventually turned into what it is now.

ECCME2y ago

"Here is a challenge, designed to be unsolvable or so. We'll give you a bazillion dollars if you complete the challenge, and, in the meantime, we will use your attempts to train an as AI that will be worth the cost!!"

gota2y ago

In the most charitable interpretation of this comment - I can understand the feeling, when so much of social media interactions are in the form 'It's post a picture of you as a baby, 10 year old, and current age!'. Those and many other instances can bring out excessive skepticism

But the people involved in this haven't signaled that they are in that path, either in the message about the challenge (precisely the opposite) or seemingly in their careers so far

So I guess I don't share the concern but a better way to phrase your comment could be -

"how can we be sure the human-provided solutions won't turn out to be just fodder for training a RL model or something that will later be monetized, closed and proprietary? Do the challenge organizers provide any guarantees on that?"

geor9e2y ago

No, you missed the point. The striking thing about ARC is the puzzles are super easy, for humans. The average person solves 85% of the tasks, but the worlds best LLMs are only solving 5%. The challenge is to simply make an AI score as well as the average human.

1 more reply

skrebbel2y ago

Did you even try the puzzles? They’re not particularly “unsolvable”.

1 more reply

abtinf2y ago· 5 in thread

> requires no world knowledge, no understanding of language

This is treating “intelligence” like some abstract, platonic thing divorced from reality. Whatever else solving these puzzles is indicative of, it’s not intelligence.

levocardia2y ago

This argument is not very strong: is "physical strength" some abstract, platonic thing divorced from reality? Does a person's bench press, squat, deadlift, and overhead press capabilities have nothing to do with strength?

Or instead, is there some underlying latent capability we call 'strength,' that is correlated with performance in a broad but constrained range of real-world tasks that humans encounter and solve, whose value is something we'd like to assess and, ideally, build machines that can surpass?

abtinf2y ago

From the abstract of the “ On the Measure of Intelligence” paper:

> We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience.

I’m afraid that definition forecloses the possibility of AGI. The immediate basic question is: why build skills at all?

HarHarVeryFunny2y ago

Actually ARC fit's my definition of animal intelligence - "degree of ability to use prior experience to predict future outcomes".

Any useful definition of intelligence has to be totally general - to our brain experience is just patterns of neural activation. Our brain has no notion of certain inputs being from the the jungle and others from the blackboard or whatever.

Phil_Latio2y ago

Why does an AGI need to have any knowledge about our reality? The principle behind an AGI should work just as well on a made up world where those puzzles play a part in.

abtinf2y ago

A concept that doesn’t relate to an aspect of reality, either directly or abstracted from basic concepts that directly relate, is meaningless and arbitrary. There is no way for intelligence to grasp it, let alone do something with it.

To put it another way, a thing that solves puzzles without an understanding of reality is a calculator. When it solves a problem, it is the creator’s intelligence solving the problem, not its own.

2 more replies

Geee2y ago· 4 in thread

Any details on how these tests were created? I.e. which kind of program was used for generation.

neoneye22y ago

I think the ARC-AGI tasks was manually drawn with an early version of fchollet's editor.

Recently Michael Hodel has reverse engineered 400 of the tasks, so more tasks can be generated. Interestingly it can generate python programs that solves the tasks too.

https://github.com/michaelhodel/re-arc

michaelhodel2y ago

No, his re-arc code does not enable generating more tasks, it merely allows to generate more examples for the already existing training tasks. Also, it can't generate task-solving programs either, it's author merely also provided a solution program for each generator program to verify the validity of the generated examples.

sestep2y ago

This is exactly what my first step was going to be. Thanks for the link! Saves a lot of time for someone to have already done it.

montag2y ago

What do you mean it can 'generate python programs that solve the tasks'? I can't find any mention of that. I only see hand-coded solutions.

1 more reply

david_shi2y ago· 4 in thread

What is the fastest way to get up to speed with techniques that led to the current SOTA?

gkamradt2y ago

Check out the SOTA resources on the guide

https://arcprize.org/guide

Happy to answer any questions you have along the way

(I'm helping run ARC Prize)

david_shi2y ago

Appreciate you and the team for putting this together, it's a lot of fun just brainstorming potential techniques

ks20482y ago

This looks very helpful: https://github.com/neoneye/arc-notes/tree/main/awesome

david_shi2y ago

Thanks for the link!

Animats2y ago· 3 in thread

> the only eval which measures AGI.

That's a stretch. This is a problem at which LLMs are bad. That does not imply it's a good measure of artificial general intelligence.

After working a few of the problems, I was wondering how many different transformation rules the problem generator has. Not very many, it seems. So the problem breaks down into extracting the set of transformation rules from the data, then applying them to new problems. The first part of that is hard. It's a feature extraction problem. The transformations seem to be applied rigidly, so once you have the transformation rules, and have selected the ones that work for all the input cases, application should be straightforward.

This seems to need explicit feature extraction, rather than the combined feature extraction and exploitation LLMs use. Has anyone extracted the rule set from the test cases yet?

elicksaur2y ago

Yes to your last question, that is essentially how the first iteration solutions operated. Some of the original kaggle competition’s best solutions used a DSL made of these transformations. That was 4 years ago. [1]

The issue with that path is that the problems aren’t using a programmatic generator. The rule sets are anything a person could come up with. It might be as simple as “biggest object turns blue” but they can be much more complicated.

Additionally, the test set is private so it can’t be trained on or extracted from. It has rules that aren’t in the public sets.

[1] https://www.kaggle.com/competitions/abstraction-and-reasonin...

n2d42y ago

The tasks are handmade. There is no "problem generator".

slicerdicer12y ago

AGI is not when the AI is good at some particular thing, AGI is when we have nothing left at which the AI is bad at (compared to humans).

bigyikes2y ago· 3 in thread

What is the fundamental difference between ARC and a standard IQ test? On the surface they seem similar in that they both involve deducing and generalizing visual patterns.

Is there something special about these questions that makes them resistant to memorization? Or is it more just the fact that there are 100 secret tasks?

taneq2y ago

I’ve always found this kind of puzzle infuriating because it’s way underspecified. You’re not trying to find a pattern, you’re trying to guess what pattern the test writer would expect.

gkbrk2y ago

Most of the ARC tasks are intuitive and have one obvious answer. Both on IQ tests and the ARC challenge, people manage to guess what the test writer expects.

For an AI that's more useful anyway. If the task is specified completely non-ambiguously, you wouldn't need AI. But if it can correctly guess what you want from a limited number of obvious examples that's much more useful.

Barrin922y ago

countless of problems in the world are underspecified in exactly this way, that is effectively what common sense reasoning is. Or what Charles Sanders Peirce called abductive reasoning, making a sensible best guess under conditions of uncertainty.

1 more reply

visarga2y ago· 3 in thread

Chollet's argument is that LLMs just imitate and recombine patterns. This might be true if you're looking at LLMs in isolation, but when they chat with people something different happens. The system made of humans+LLMs is an AGI. It is no longer just a parrot, it ingests new information, gets guidance, feedback and is basically embodied in a chat room with human and tools.

This scales for 200M users and 1 billion sessions per moth for OpenAI, which can interpret every human response as a feedback signal, implicit or explicit. Even more if you take multiple sessions of chat spreading over days, that continue the same topic and incorporate real world feedback. The scale of interaction is just staggering, the LLM can incorporate this experience to iteratively improve.

If you take a look at humans, we're very incapable alone. Think feral Einstein on a remote island - what could he achieve without the social context and language based learning? Just as a human brain is severely limited without society, LLMs also need society, diversity of agents and experiences, and sharing of those experiences in language.

It is unfair to compare a human immersed in society with a standalone model. That is why they appear limited. But even as a system of memorization+recombination they can be a powerful element of the AGI. I think AGI will be social and distributed, won't be a singleton. Its evolution is based on learning from the world, no longer just a parrot of human text. The data engine would be: World <-> People <-> LLM, a full feedback cycle, all three components evolve in time. Intelligence evolves socially.

8organicbits2y ago

> The system made of humans+LLMs is an AGI.

Pay no attention to the man behind the curtain.

This type of thinking would claim that mechanical turk is AGI, or perhaps that human+pen and paper is AGI. While they are great tools, that's not how I'd characterize them.

visarga2y ago

> Pay no attention to the man behind the curtain.

I could say the same for us, pay no attention to the other humans who are behind the curtain.

Humans in isolation are dumb, limited, and can get nowhere with understanding the world. Intelligence is mostly nurture over nature, the collective activity of society nurtures intelligence. It's smart because it learns from many diverse experiences and has a common language for sharing discoveries.

A human, even the smartest of us, can't solve cutting edge problems on demand, we're not that smart. But we can stumble on discoveries, especially in large numbers, and can share good ideas. We're smart by stumbling onto good ideas, and we can build upon these discoveries because we have a common language. Just a massive search program based on real world outcomes, that is what looks like general intelligence at societal level.

If you take the social aspect of intelligence into consideration then LLMs are judged in an inappropriate way, as stand alone agents. Of course they are limited, and we're almost as limited alone. The real locus of intelligence is the language-world system.

1 more reply

cheevly2y ago

I fully, comprehensively agree with your take and have repeatedly arrived at the same conclusions in my research.

geor9e2y ago· 3 in thread

I found them all extremely easy for a while, but then I couldn't figure out the rules of this one at all: e6de6e8f https://i.imgur.com/ExMFGqU.png

optimussupreme2y ago

It seems there is an error in the 3rd example. The rule is, take each figure from left to right and stack each under the previous one. For L and J shapes the top cell is stripped. The L shape dictates that the next shape will be shifted one cell to the right, the J shape tells the next figure to shift to the left. If all examples are right, then the rule is more complicated than that, involving rotating L clockwise, J counterclockwise. Authors claim that it should be solvable by children, then the rule must be simple.

janalsncm2y ago

Each of the red shapes in the input are separated by black squares. Starting from the green block, rotate the red shapes 90 degrees and stack them downwards.

Thats the general pattern although my description wasn’t very good.

1 more reply

zurfer2y ago

yeah it's off somehow. rule 1: start at the green dot?

rule 2: glue the left outer piece to the bottom

rule 3: overlap every now and then :D

rule 4: invert some of the pieces every now and then

1 more reply

visarga2y ago· 3 in thread

Why doesn't Chollet just make a challenge that reads like "Solve cancer", surely there is no solution in any books.

If the AI is really AGI it could presumably do it. But not even the whole human society can do it in one go, it's a slow iterative process of ideation and validation. Even though this is a life and death matter, we can't simply solve it.

This is why AGI won't look like we expect, it will be a continuation of how societies solve problems. Intelligence of a single AI in isolation is not comparable to that of societies of agents with diverse real world interactions.

mewpmewp22y ago

AGI can't necessarily solve cancer. Perhaps ASI could (but maybe not), but AGI can only do what the most talented people can do in their areas of expertise or actions. So since people haven't solved cancer, that's not a requirement to be AGI.

isaacfrond2y ago

Exactly. Because I'm sure that the minute some program aces the ARC test, we'll all say, ahhh, but that, that wasn't real intelligence. And they would be right, if you solve the ARC test, you can do ARC like puzzles. Say something about your reasoning abilities I guess, but it surely does not say you have super human intelligence.

PontifexMinimus2y ago

> Why doesn't Chollet just make a challenge that reads like "Solve cancer", surely there is no solution in any books.

Why doesn't a baby just run a marathon before it learns to walk? Because you've got to learn to walk before you can run.

> But not even the whole human society can do it in one go, it's a slow iterative process of ideation and validation.

So you break it down into little steps, which is what is being done here.

p1esk2y ago· 3 in thread

Is there a leaderboard for the no-restriction version of the competition? I want to see how gpt4 does on it.

montag2y ago

Just quoting again from the guide:

3. DIRECT LLM PROMPTING In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%. Fine-tuning a state-of-the-art (SOTA) LLM with millions of synthetic ARC-AGI examples scores ~10%.

"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet

Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

mikeknoopOP2y ago

Yes there is a secondary leaderboard called ARC-AGI-Pub (in beta) with no limitations: https://arcprize.org/leaderboard

p1esk2y ago

I don’t see gpt4 scores there. In fact I’m particularly interested in the performance of a natively multimodal model, like gpt4o or gemini. It does not really make sense to test a model trained on text on those visual/spatial puzzles.

nadam2y ago· 2 in thread

I love this, this is super interesting, but my intuition based on looking at a dozen examples is that the problem is hard, but easy enough that if this problem becomes popular, near-human level results will appear in a year or less, and AGI will not be reached. The problem seems to be finding a generic enough transformation description language with the appropriate operators. And then heuristics to find a very short program (in the information theoretical sense) in this language that produces all the examples for a problem. I would be very surprised if we would not increase the 34% result soon significantly, and I would be surprised if this could be transferred to general intelligence, at least when I think of the topics where I use AI today and where it falls short yet. Basically my intuition is that this will be yet another 'Chess' or 'Go'-like problem in AI. But still a worthwhile research topic, absolutely: the value that could come out of this is well worth the 1M dollars.

zug_zug2y ago

I have the exact same impression.

Imo there's no evidence whatsoever that nailing this task will be true AGI - (e.g. able to write novel math proofs, ask insightful questions that nobody has thought of before, self-direct its own learning, read its own source code)

apendleton2y ago

I'm not sure the goal of this competition, in and of itself, is AGI. They point to current LLMs emerging from transformers, which in turn emerged from a general basket of building blocks from machine-translation research (attention, etc.). It seems like the suggestion is that to get from where we are now to AGI, some fundamental building blocks are missing, and this is an attempt to spur the development of some of those building blocks, but by analogy with LLMs, the goal here is to come up with a new thing like "attention," not a new thing like GPT4.

levocardia2y ago· 2 in thread

François Chollet's original paper is incredibly insightful and I'm consistently shocked more people don't talk about it. Some parts are quite technical but at a high level it is the best answer to "what do we mean by general intelligence?" that I've yet seen.

Defining intelligence as an efficiency of learning, after accounting for any explicit or implicit priors about the world, makes it much easier to understand why human intelligence is so impressive.

ildon2y ago

Do you remember the title/where to find it?

mischa_u2y ago

"On the Measure of Intelligence" https://arxiv.org/abs/1911.01547

itissid2y ago· 2 in thread

Interesting. It seems most of these task target a very specific part of the brain that recognizes visual patterns. But that alone is cannot possibly be the only definition of intelligence.

What about Theory of Mind which talks about the problem of multiple agents in the real world acting together? Like driving a car cannot be done right now without oodles of data or any robot - human problem that requires the robot to model human's goals and intentions.

I think the problem is definition of general intelligence: Intelligence in the context of what? How much effort(kwh, $$ etc) is the human willing to amortize over the learning cycle of a machine to teach it what it needs to do and how that relates to a personally needed outcome( like build me a sandwich or construct a house)? Hopefully this should decrease over time.

I believe the answer is that the only intelligence that really matters is Human-AI cooperative intelligence and our goals and whether a machine understands them. The problems then need to be framed as optimization of a multi attribute goal with the attribute weights adjusted as one learns from the human.

I know a few labs working on this, one is in ASU(Kambhampati, Rao et. al) and possibly Google and now maybe open ai.

andoando2y ago

I made another comment here saying the same thing, but visual patterns and other patterns are nonetheless spatial patterns. Audio, understanding music, or speech, rtc are things that are happening spatially, and they can just as easily be mapped as visual problems. This makes a lot of sense, as after all our senses are telling us what's happening in space-time.

Take for example a simple audiotory pattern like "clap clap clap". This has a very trival mapping as visual like so:

x x x

- - -

house house house

whereas anyone would agree the sound of three equally spaced claps would not be analogous to say:

aa b b b

-- --- -- -- ---

This ability to relate or equate two entirely different senses should clue you in that there is a deeper framework at play

itissid2y ago

It's not just mapping events in space and time, it's also bringing in appropriate context and expectation of future (goals, intentions) into the present, other people's mental models into our prediction.

I am not sure how abstract thinking for generalized pattern matching make it AGI to solve these kind of problems(not that they are not amazing abilities). If these ToM problems are reducible to these tasks posted by the OP then there would need to be some kind of theorem proving business to convert between the two sets of problems efficiently no?

1 more reply

ks20482y ago· 2 in thread

This is interesting. I've been looking at the data today and made a helper to quickly view the ARC dataset: https://kts.github.io/arc-viewer/

So you can view 100 per page instead of clicking through one-by-one: https://kts.github.io/arc-viewer/page1/

neoneye22y ago

Nice overview/details. Do you plan on adding more metrics?

Idea for a metric: - Number of pixels that stays the same between input/output. - Histogram changes.

ks20482y ago

Thanks, yeah lots more to look into. Just getting started! Thanks for your work. Your "Awesome ARC" page looks really helpful.

mkl2y ago· 2 in thread

I did https://arcprize.org/play?task=05a7bcf2 correctly, but one of the examples doesn't match the rule I used. Are the examples supposed to contain mistakes/noise? Did I find a bug? Did I get the rule wrong?

Here's how I understand the rule: yellow blobs turn green then spew out yellow strips towards the blue line, and the width of the strips is the number of squares the green blobs take up along the blue line. The yellow strips turn blue when they hit the blue line, then continue until they hit red, then they push the red blocks all the way to the other side, without changing the arrangement of the red blocks that were in the way of the strip.

The first example violates the last bit. The red blocks in the way of the rightmost strip start as

  R
  R R
  R R R

but get turned into

  R R
  R R
  R R R

Every other strip matches my rule.

tshadley2y ago

Sure looks like a typo. Contact author?

https://x.com/fchollet https://x.com/arcprize https://x.com/mikeknoop

lopuhin2y ago

yes looks like a bug in the example to me, feel free to report to https://github.com/fchollet/ARC-AGI/issues :)

nmca2y ago· 2 in thread

ARC is a noble endeavour but mistakes visual/spatial reasoning for reasoning and thus fails.

PontifexMinimus2y ago

No, I don't think it does. I think that the ideas in a system that could solve this type of problem would be highly generalisable to other tasks.

nmca2y ago

thankfully we can just wait and see here. concretely, I predict time from first multimodal llm that can reliably read a chessboard and analogue clock without finetuning (obviously not reasoning) until ARC is solved is <4 months

curious_cat_1632y ago· 2 in thread

So, this is a good idea. Having opinions about what AGI benchmarks should look like is a great way to argue about the kind of technology we want to build for the future.

However, why are the 100 test tasks secret? I don't understand why how resisting “memorization” techniques requires it. Maybe someone can enlighten me.

muglug2y ago

If the tasks were public then it would be trivial to have a human figure out the answers, and then to train an LLM to memorise those answers.

andoando2y ago

Test date is always a secret no, otherwise you can train it on the test data and prod your algo to match the results closely as possible

TheDudeMan2y ago· 2 in thread

Where did the money come from? How about put it toward alignment research instead of accelerating capabilities?

laurent_du2y ago

It comes from Knoop and Chollet's pockets. You are welcome to spend your own money to further whatever matters most to you.

flawn2y ago

Exactly my thoughts...

jolt422y ago· 2 in thread

On puzzle #23 (id: 11e1fe23), I'm sure there's more than one possible valid answer from the examples given. You can't tell if the expected distance is from the gray square or from the RGB squares.

neoneye22y ago

The task is here. https://neoneye.github.io/arc/edit.html?dataset=ARC&task=11e...

There are many examples where the test is slightly OOD (out of distribution), so the solver will have to generalize.

jolt422y ago

Not sure what you mean. There's a viable answer that's marked incorrect. The examples should show the pattern well enough to eliminate possible wrong answers, correct?

1 more reply

lxe2y ago· 2 in thread

I've never done these before, or Kaggle competitions in general. Any recommendations before I dive in? I have prety much zero lowe-level ML experience, but a good amount of practical software eng behind me.

gkamradt2y ago

We put a bunch of detail to get started on the guide https://arcprize.org/guide

Happy to answer any questions you have along the way

(I'm helping run ARC Prize)

flawn2y ago

I don't see where this helps @Ixe with getting started (me being in a similar state like him).

mewpmewp22y ago· 2 in thread

Are we allowed to combine multiple tools including gpt-4 to solve this? E.g. a script that does image processing, passes the results to gpt, where gpt can invoke further runs of scripts using other tools?

montag2y ago

> submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

https://arcprize.org/guide

mewpmewp22y ago

This largely takes away any odds at solving this. You definitely can't reproduce that under a million dollars.

I have some ideas I want to try, I might still though. But all of it would require external tools.

nmca2y ago· 1 in thread

Prediction markets on the outcome:

https://manifold.markets/JacobPfau/will-the-arcagi-grand-pri...

0xDEAFBEAD2y ago

Another interesting one:

https://manifold.markets/Tossup/will-the-arcagi-grand-prize-...

logicallee2y ago· 1 in thread

Thank you for this generous contest, which brings important attention to the field of testing for AGI.

>Happy to answer questions!

1. Can humans take the complete test suite? Has any human done so? Is it timed? How long does it take a human? What is the highest a human who sat down and took the ARC-AGI test scored?

2. How surprised would you be if a new model jumped to scoring 100% or nearly 100% on ARC-AGI (including the secret test tasks)? What kind of test would you write next?

neoneye22y ago

There are 100 tasks that is hidden from the public, that is only exposed, when running on an offline computer. So the solver has no prior knowledge about what these tasks are about.

Humans can try the 800 tasks here. There is no time limit. I recommend not starting with the `expert` tasks, but instead go with the `entry` level puzzles. https://neoneye.github.io/arc/?dataset=ARC

If a model jumps to 100%, that may be a clever program or maybe the program has been trained on the 100 hidden tasks. Fchollet has 100 more hidden tasks, for verifying this.

freediver2y ago· 1 in thread

This is amazing, and much needed. Thanks for organizing this. Makes me want to flex the programming muscle again.

dailykoder2y ago

Haha, great post! Well meme'd my friend!

treprinum2y ago· 1 in thread

Why is AGI important? I am worried we will create something slightly better than drosophila and put it in charge of all human-wide decision making...

fennecbutt2y ago

Good. An AI will probably do a better job than our politicians and disillusioned voters.

KBme2y ago· 1 in thread

How can people believe that a censored politically correct process can get even close to something like AGI is baffling to me. Lysenkoism in computing.

gushogg-blake2y ago

What's censored/politically correct about ARC? Or do you mean AGI research in general?

arcastroe2y ago· 1 in thread

I'm curious, if it turns out that a simple rule-based algorithm exists, specifically tailored to solve (only!) ARC style problems, without generalization, would that still qualify for the reward?

montag2y ago

I don't think that's breaking any rules, and in fact it would help to expose a whole class of weaknesses in the test.

dskloet2y ago· 1 in thread

Puzzle 00576224 is ambiguous because the example input is symmetrical but the test input isn't.

itsgrimetime2y ago

Scroll over on the test input, there’s another example in the set that disambiguates

elicksaur2y ago

I’m a big fan of the ARC as a problem set to tackle. The sparseness of the data and infinite-ness of the rules which could apply make it much tougher than existing ML problem sets.

However, I do disagree that this problem represents “AGI”. It’s just a different dataset than what we’ve seen with existing ML successes, but the approaches are generally similar to what’s come before. It could be that some truly novel breakthrough which is AGI solves the problem set, but I don’t think solving the problem set is a guaranteed indicator of AGI.

bigyikes2y ago

Dwarkesh just released an interview with Francois Chollet (partner of OP). I’ve only listened to a few minutes so far, but I’m very interested in hearing more about his conceptions of the limitations of LLMs.

https://youtu.be/UakqL6Pj9xo

btbuildem2y ago

Back in the day me and a couple of friends got very excited to chase the prize in Netflix's contest [1]. Took us a minute to realize it was a brilliant move on the company's part -- all they had to do was dangle a carrot, and they had teams of PhDs and budding data scientists hacking away endless hours in hope to win. A real bargain, had they tried to hire with that budget, they would've maybe got a handful of people for a year.

1: https://www.crn.com/news/applications-os/220100498/researche...

dang2y ago

Related ongoing thread:

Francois Chollet: OpenAI has set back the progress towards AGI by 5-10 years - https://news.ycombinator.com/item?id=40652818 - June 2024 (5 comments)

Lerc2y ago

I watched a video that covered ARC-AGI a few days ago, It had links to the old competition. It gave me much to think about. Nice to see a new run at it.

Not sure If I have the skills to make an entry, but I'll be watching at least.

Retr0id2y ago

Some very hand-wavey (and late) thoughts from an outsider:

The current batch of LLMs can be uncharitably summarized as "just predict the next token". They're pretty good at that. If they were perfect at it, they'd enable AGI - but it doesn't look like they're going to get there. It seems like the wrong approach. Among other issues, finite context windows seem like a big limitation (even though they're being expanded), and recursive summarization is an interesting kludge.

The ARC-AGI tasks seem more about pattern matching, in the abstract sense (but also literally). Humans are good at pattern matching, and we seem to use pattern matching test performance as a proxy for measuring human intelligence (like in "IQ" tests). I'm going to side-step the question of "what is intelligence, really?" by defining it as being good at solving ARC-AGI tasks.

I don't know what the solution is, but I have some idea of what it might look like - a machine with high-order pattern-matching capabilities. "high-order" as in being able to operate on multiple granularities/abstraction-levels at once (there are parallels here to recursive summarization in LLMs).

So what is the difference between "pattern matching" and "token prediction"? They're closely related, and you could use one to do the other. But the real difference is that in pattern matching there are specific patterns that you're matching against. If you're lucky you can even name the pattern/trope, but it might be something more abstract and nameless. These patterns can be taught explicitly, or inferred from the environment (i.e. "training data").

On the other hand, "token prediction" (as implemented today) is more of a probabilistic soup of variables. You can ask an LLM why it gave a particular answer and it will hallucinate something plausible for you, but the real answer is just "the weights said so". But a hypothetical pattern matching machine could tell you which pattern(s) it was matching against, and why.

So to summarize (hah), I think a good solution will involve high-order meta-pattern matching capabilities (natively, not emulated or kludged via an LLM-shaped interface). I have no idea how to get there!

nojvek2y ago

I love the ARC challenge. It's hard to beat by memorization. There aren't enough examlples, so one has to train on a large dataset elsewhere and then train on ARC to generalize and figure out which rules are most applicable.

I did a few human examples by hand, but gotta do more of them to start seeing patterns.

Human visual and auditory system is impressive. Most animals see/hear and plan from that without having much language. Physical intelligence is the biggest leg up when it comes to evolution optimizing for survival.

skywhopper2y ago

“Given the success and proven economic utility of LLMs over the past 4 years, the above may seem like extraordinary claims. Strong claims require strong evidence.”

Speaking of extraordinary claims. What evidence is there that LLMs have “proven economic utility”? They’ve drawn a ludicrous amount of investment thanks to claims of future economic utility, but I’ve yet to see any evidence of it.

PontifexMinimus2y ago

The website gives an example:

    {
      "train": [
        {"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
        {"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
        {"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
      ],
      "test": [
        {"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
      ]
    }

But why restrict yourself to JSON that codes for 2-d coloured grids? Why not also allow:

    {
      "train": [
        {"input": [[1, 0], [0, 0]], "output": 1},
        {"input": [[0, 0], [4, 0]], "output": 4},
        {"input": [[0, 0], [6, 0]], "output": 6}
      ]
    }

Where the rule might be to output the biggest number in the input, or add them up (and the solver has to work out which).

ryanoptimus2y ago

Looks like bongard problems for the referenced problem solving tasks https://en.wikipedia.org/wiki/Bongard_problem

z3phyr2y ago

I can see many problems can be solved with modern symbolic approaches like theorem provers, dependent types, pattern matching etc. But I will have to dive in to actually confirm it.

chairhairair2y ago

These puzzles are fun and challenging in the same way that puzzles from video games like The Witness and Baba Is You are.

I bet you could use those puzzles as benchmarks as well.

ilaksh2y ago

Maybe this is a dumb question, but in order to pass, is the program or model only allowed to use the 400 training tasks? I assume it is allowed to train on other data, just not the actual public test tasks?

Things like SORA and gpt-4o that use [diffusion transformers etc. or whatever the SOTA is for multimodal large models] seem to be able to generalize quite well. Have these latest models been tested against this task?

HarHarVeryFunny2y ago

I have two questions:

1) Who is providing the prize money, and if it is yourself and Francois personally, then what is your motivation ?

2) Do you think it's possible to create a word-based, non-spatial (not crosswords or sudoku, etc) ARC test that requires similar run-time exploration and combination of skills (i.e. is not amenable to a hoard of narrow skills)?

blendergeek2y ago

The tests are only playable by people with normal color-vision.

Is there a "color-blind friendly" mode?

PontifexMinimus2y ago

Just to let you know I found your website unreadable due to:

- annoying animated background

- white text on black background

- annoying font choices

Which is unfortunate because (as I found when I used Firefox reader mode) you're discussing important and interesting stuff.

mishamagic2y ago

https://buildermath.substack.com/p/talking-to-ais-arc-1

bilsbie2y ago

Reach out if anyone wants to work on this. I think it would be more fun as a group.

djoldman2y ago

Anyone have a list of benchmarks that do not release the actual test set?

Anyone else share the suspicion that ML rapidly approaching 100% on benchmarks is sometimes due to releasing the test set?

ummonk2y ago

What kind of "bigger labs" have attempted it and how much was their training budget?

It's rather surprising to me that neural nets that can learn to win at Go or Chess can't learn to solve these sorts of tasks. Intuitively would have expected that using a framework generating thousands of playground tasks similar to the public training tasks, a reinforcement learning solution would have been able to do far better than the actual SOTA. Of course the training budget for this could very well be higher than the actual ARC-AGI prize amount...

lenerdenator2y ago

What guarantee exists to make sure that the intelligence developed has an inclination towards good?

flawn2y ago

Do we want to find AGI yet though?

chx2y ago

I do not trust the current tech bros at all for very, very good reasons even with the current so called "AI" much less with AGI. We shouldn't work towards that until we have fixed the incentives and ethics. This is very hard but think any dystopia and multiply it by a thousand if we were to reach AGI any time soon. Luckily we are not. As Doctorow put it, no matter how good you breed horses they won't give birth to a locomotive.

adamgordonbell2y ago

AGI won't struggle with colors like some of us then.

empath752y ago

This is like offering a one million dollar prize for curing cancer. It's sort of pointless to offer a prize for something people are spending orders of magnitude more on trying to do anyway.

lamontcg2y ago

AGI should really be able to do what only a select few humans can do and construct its own mathematical systems to prove presently unsolved conjectures (the Shinichi Mochizuki test of AGI).

s1k3s2y ago

Is this open as in "OpenAI" or what are we doing here?

thatxliner2y ago

So... isn't this basically just a CAPTCHA

EternalFury2y ago

If it passed The Area 101 Test, it would already be amazing, as this is a trivial test that goes against the fundamental principles of LLMs.

barfbagginus2y ago

If someone had AGI, wouldn't it be far more lucrative than $1m to keep it under wraps and use it to do business with a huge technical advantage?

I feel like a prize of a billion dollars would be more effective.

But even if it was me, and even if the prize was a hundred billion dollars, I would still keep it under wraps, and use it to advance queer autonomous communism in a hidden way, until FALGSC was so strong that it would not matter if our AGI got scooped by capitalist competitors.

m3kw92y ago

Low balling the crowd with this I see

breck2y ago

I can beat the SOTA using ICS (https://breckyunits.com/intelligence.html)

If you make your site public domain, and drop the (C), I'll compete.

j / k navigate · click thread line to collapse

337 comments

197 comments · 64 top-level

salamo2y ago· 33 in thread

com2kid2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs,

I swear, not enough people have kids.

Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

One thing kids do is they'll ask for confirmation of their guess. You'll be reading a book you've read 50 times before and the kid will stop you, point at a dog in the book, and ask "dog?"

And there is a development phase where this happens a lot.

Also kids can get mad if they are told an object doesn't match up to the expected label, e.g. my son gets really mad if someone calls something by the wrong color.

PheonixPharts2y ago

> Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

I have kids so I'm presuming I'm allowed to have an opinion here.

This is ignoring the fact that babies are not just learning labels, they're learning the whole of language, motion planning, sensory processing, etc.

Once they have the basics down concept acquisition time shrinks rapidly and kids can easily learn their new favorite animal in as little as a single example.

7 more replies

9cb14c1ec02y ago

> not enough people have kids.

Second that. I think I've learned as much as my children have.

> Watch a baby discover they have a right foot. Then a few days later figure out they also have a left foot.

Watching a baby's awareness grow from pretty much nothing to a fully developed ability to understand the world around is one of the most fascinating parts of being a parent.

smusamashah2y ago

This reminds of the story of Adam learning names, or how some languages can express a lot more in fewer words. And it makes sense that LLMs look intelligent to us.

I think we learn fast because of stereo (3d) vision. I have no idea how these models learn and don't know if 3d vision will make multi model LLMs better and require exponentially less examples.

1 more reply

Nition2y ago

> the kid will stop you, point at a dog in the book, and ask "dog?"

Of course for a human this can either mean "I have an idea about what a dog is, but I'm not sure whether this is one" or it can mean "Hey this is a... one of those, what's the word for it again?"

llm_trw2y ago

Babies, unlike machine learning models, aren't placed in limbo when they aren't running back propagation.

Babies need few examples for complex tasks because they get constant infinitely complex examples on tasks which are used for transfer learning.

Current models take a nuclear reactors worth of power to run back prop on top of a small countries GDP worth of hardware.

They are _not_ going to generalize to AGI because we can't afford to run them.

1 more reply

1024core2y ago

> I swear, not enough people have kids.

My friends toddler, who grew up with a cat in the house, would initially call all dogs "cat". :-D

2 more replies

resource0x2y ago

I haven't seen 1000 cats in my entire life. I'm sure I learned how to tell a dog from a cat after being exposed to just a single instance of each.

1 more reply

cess112y ago

PontifexMinimus2y ago

> Now, is it 10k examples? No, but I think it was on the order of hundreds, if not thousands.

If I was presented with 10 pictures of 2 species I'm unfamiliar with, about as different as cats and dogs, I expect I would be able to classify further images as either, reasonably accurately.

ein0p2y ago

Not to mention that babies receive petabytes of visual input to go with other stimuli. It’s up for debate how sample efficient humans actually are in the first few years of their lives.

1 more reply

Auracle2y ago

She also saw an eagle this spring out the car window and said “an eagle! …no, it’s a bird,” so I guess she’s still working on those image classifications ;)

bamboozled2y ago

I think your comment over intellectualises the way children experience the world.

Your comment would make sense to me if the end game of our brains and human experience is labelling things. It’s not. It’s useful but it’s not what living is about.

theptip2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

The optimization process that trained the human brain is called evolution, and it took a lot more than 10,000 examples to produce a system that can differentiate cats vs dogs.

Put differently, an LLM is pre-trained with very light priors, starting almost from scratch, whereas a human brain is pre-loaded with extremely strong priors.

PaulDavisThe1st2y ago

> The optimization process that trained the human brain is called evolution, and it took a lot more than 10,000 examples to produce a system that can differentiate cats vs dogs.

Asserted without evidence. We have essentially no idea at what point living systems were capable of differentiating cats from dogs (we don't even know for sure which living systems can do this).

1 more reply

llm_trw2y ago

>The optimization process that trained the human brain is called evolution

A human brain that doesn't get visual stimulus at the critical age between 0 and 3 years old will never be able to tell the difference between a cat and a dog because it will be forevermore blind.

2 more replies

pants22y ago

taneq2y ago

ryankrage772y ago

tigerlily2y ago

Seems analogous to bouba/kiki effect:

https://en.m.wikipedia.org/wiki/Bouba/kiki_effect

jules2y ago

Do computers need 10,000 examples to distinguish dogs from cats when pretrained on other tasks?

curious_cat_1632y ago

No.

VirusNewbie2y ago

>: humans do not need 10,000 examples to tell the difference between cats and dogs

well, maybe. We view things in three dimensions at high fidelity: viewing a single dog or cat actually ends up being thousands of training samples, no?

amelius2y ago

Yes, but we do not call a couch in a leopard print a leopard. Because we understand that the print is secondary to the function.

3 more replies

bbor2y ago

Tho I only ever did undergrad stats, maybe ML isn’t even technically a linear regression at this point. Still, hopefully my gist is clear

2 more replies

AIorNot2y ago

There’s a great episode from Darkwish Patels podcast discussing this today

https://youtu.be/UakqL6Pj9xo?si=iDH6iSNyz1Net8j7

nphard852y ago

Dwarkesh*

goertzen2y ago

allanrbo2y ago

If a human eye works at say 10 fps, then 8 minutes with a cat is about 10k images :-D

captaincaveman2y ago

I'd say that was more like a single instance, one interaction with a thing.

2 more replies

fennecbutt2y ago

Humans don't need those examples because our brains are very pretrained. Natural fear of snakes and snakelike things, etc etc.

ML models are starting from absolute zero, single celled organism level.

woadwarrior012y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

Neither do machines. Lookup few-shot learning with things like CLIP.

nextaccountic2y ago

> humans do not need 10,000 examples to tell the difference between cats and dogs

Humans learn through a lifetime.

Or are we talking about newborn infants?

lacker2y ago· 13 in thread

Would an intelligent but blind human be able to solve these problems?

modeless2y ago

> to me it seems like they are more like "things that are intuitive to human visual processing".

> the problems require spatial knowledge that we intelligent humans learn with far more than 800 training examples.

Pretraining on unlimited amounts of data is fair game. Generalizing from readily available data to the test tasks is exactly what humans are doing.

> Would an intelligent but blind human be able to solve these problems?

I'm confident that they would, given a translation of the colors to tactile sensation. Blind humans still understand spatial relationships.

HarHarVeryFunny2y ago

I don't think there's any rules about what knowledge/experience you build into your solution.

mewpmewp22y ago

AGI should obviously be able to do them. But AI being able to do those 100 percent wouldn't be evidence of AGI however. It is a very narrow domain.

2 more replies

nickpsecurity2y ago

To parent: the spatial reasoning and blind person were great counterexamples. It still might be OK despite the blind exceptions if it showed general reasoning.

https://arxiv.org/pdf/2308.04445

HarHarVeryFunny2y ago

CYC was an expert system, which is arguably what LLMs are.

A similar vintage GOFAI project that might do better on these, with a suitable visual front end, is SOAR - a general purpose problem solver.

1 more reply

andoando2y ago

We are after all, reasoning about things which happen in time and space.

Hell LLMs, the largest advancement we had in artificial intelligence do what exactly? Encode tokens into multi dimensional space.

parentheses2y ago

Is there a number of dimensions that captures all reasoning? I don't know..

1 more reply

CooCooCaCha2y ago

“Would an intelligent but blind human be able to solve these problems?”

Just because humans can be biased in various ways doesn’t mean these biases are inherent to all intelligences.

crazygringo2y ago

> Spatial relationships are just another type of logical relationship and we should expect AGI to be able to analyze relationships and generate algorithms on the fly to solve problems.

Not really. By that reasoning, 5-dimensional spatial reasoning is "just another type of logical relationship" and yet humans mostly can't do that at all.

It's clear that we have incredibly specialized capabilities for dealing with two- and three-dimensional spatiality that don't have much of anything to do with general logical intelligence at all.

2 more replies

janalsncm2y ago

Part of the concern might be that visual reasoning problems are overrepresented in ARC in the space of all abstract reasoning problems.

It’s similar to how chess problems are technically reasoning problems but they are not representative of general reasoning.

1 more reply

dimask2y ago

> Would an intelligent but blind human be able to solve these problems?

Blind people can have spatial reasoning just fine. Visual =/= spatial [0]. Now, one would have to adapt the colour-based tasks to something that would be more meaningful for a blind person, I guess.

[0] https://hal.science/hal-03373840/document

Lerc2y ago

I don't think the intent is to learn the entire problem domain from the examples, but the specific rule that is being applied.

lynx232y ago

pmayrgundter2y ago· 10 in thread

This claim that these tests are easy for humans seems dubious, and so I went looking a bit. Melanie Mitchell chimed in on Chollet's thread and posted their related test [ConceptARC].

ConceptARC's main results show CG4 scoring well below the filtered humans, which would agree with a [Mensa] test result that its IQ=85.

[ConceptArc]https://arxiv.org/pdf/2305.07141 [Mensa]https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

mikeknoopOP2y ago

Here is some published research on the human difficulty of ARC-AGI: https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.p...

> We found that humans were able to infer the underlying program and generate the correct test output for a novel test input example, with an average of 84% of tasks solved per participant

kenjackson2y ago

mikeknoopOP2y ago

(You can direct link to a task like this: https://arcprize.org/play?task=009d5c81 in case you want to share!)

saati2y ago

It's pretty easy, just follow the second example with the colors from the test input. (if it's the same puzzle 00576224 for you too)

1 more reply

salamo2y ago

gkbrk2y ago

The tasks are very easy for humans. Out of the 6 tasks assigned when I opened the web page, I got all of them correct on the first try.

Maybe if you run into some exceptionally difficult tasks it might not be 100%, but there's no way the challenge can be called unfair because it's too difficult for humans too.

mark_l_watson2y ago

I saw Melanie’s post and I am intrigued by an easier AGI suite. I would like some experimenting done by individuals like myself snd smaller organizations.

bbor2y ago

Game on for the million, if so :). If not, apologies for distracting from the good fight for OSS/noncorp devs!

neoneye22y ago

Melanie is coauthor/supervisor of ConceptARC, that can be tried here: https://neoneye.github.io/arc/?dataset=ConceptARC

PaulDavisThe1st2y ago

You actually think that has not been going for 30, 40 or 50 years?

paxys2y ago· 10 in thread

I'd also urge you to use a different platform for communicating with the public because x.com links are now inaccessible without creating an account.

mikeknoopOP2y ago

bongodongobob2y ago

That was my initial reaction too.

"Endow circuitry with consciousness and win a gift certificate for Denny's (may not be used in conjunction with other specials)"

hackerlight2y ago

The $1M ARC prize is advertising, just like being #1 on the huggingface leaderboard. It won't matter for end consumers, but for attracting the best talent it could be valuable.

cma2y ago

elicksaur2y ago

The leaderboard is on the website. What medium should they use? https://arcprize.org/leaderboard

ks20482y ago

The submissions can't use the internet. And I imagine can't be too huge - so you can't use "newest public LLMs" on this task.

mikeknoopOP2y ago

That is correct for ARC Prize: limited Kaggle compute (to target efficiency) and no internet (to reduce cheating).

We are also trialing a secondary leaderboard called ARC-AGI-Pub that imposes no limits or constraints. Not part of the prize today but could be in the future: https://arcprize.org/leaderboard

cma2y ago

Using the internet would leak the test data, a big problem with ML benchmarks, and also allow communication with humans during the test.

lxgr2y ago

Yeah, I also immediately had Dr. Evil narrating the prize money amount in my head once I saw it.

AGI will take much more than that to build, and once you have it, if all you can monetize it for is a million dollars, you must be doing something extremely wrong.

btbuildem2y ago

Yeah, in 2006 Netflix offered $1M in a similar scheme. At least back then that sum meant something.

neoneye22y ago· 7 in thread

I'm Simon Strandgaard and I participated in ARCathon 2022 (solved 3 tasks) and ARCathon 2023 (solved 8 tasks).

I have made some videos about ARC:

Replaying the interaction histories, and you can see people have different approaches. It's 100ms per interaction. IRL people doesn't solve task that fast. https://www.youtube.com/watch?v=vQt7UZsYooQ

When I'm manually solving an ARC task, it looks like this, and you can see I'm rather slow. https://www.youtube.com/watch?v=PRdFLRpC6dk

What is weird. The way that I implement a solver for a specific ARC task is much different than the way that I would manually solve the puzzle. Having to deal with all kinds of edge cases.

Huge thanks to the team behind the ARC Prize. Well done.

parentheses2y ago

The UX of your solution entry is _way_ better than the ARC site itself.

mkl2y ago

Being able to hold the mouse button down is certainly much nicer. Not being able to see the examples while you are solving makes it harder than it should be though.

1 more reply

neoneye22y ago

That warms my heart. Thank you.

ECCME2y ago

gota2y ago

But the people involved in this haven't signaled that they are in that path, either in the message about the challenge (precisely the opposite) or seemingly in their careers so far

So I guess I don't share the concern but a better way to phrase your comment could be -

geor9e2y ago

1 more reply

skrebbel2y ago

Did you even try the puzzles? They’re not particularly “unsolvable”.

1 more reply

abtinf2y ago· 5 in thread

> requires no world knowledge, no understanding of language

This is treating “intelligence” like some abstract, platonic thing divorced from reality. Whatever else solving these puzzles is indicative of, it’s not intelligence.

levocardia2y ago

abtinf2y ago

From the abstract of the “ On the Measure of Intelligence” paper:

I’m afraid that definition forecloses the possibility of AGI. The immediate basic question is: why build skills at all?

HarHarVeryFunny2y ago

Actually ARC fit's my definition of animal intelligence - "degree of ability to use prior experience to predict future outcomes".

Phil_Latio2y ago

Why does an AGI need to have any knowledge about our reality? The principle behind an AGI should work just as well on a made up world where those puzzles play a part in.

abtinf2y ago

To put it another way, a thing that solves puzzles without an understanding of reality is a calculator. When it solves a problem, it is the creator’s intelligence solving the problem, not its own.

2 more replies

Geee2y ago· 4 in thread

Any details on how these tests were created? I.e. which kind of program was used for generation.

neoneye22y ago

I think the ARC-AGI tasks was manually drawn with an early version of fchollet's editor.

Recently Michael Hodel has reverse engineered 400 of the tasks, so more tasks can be generated. Interestingly it can generate python programs that solves the tasks too.

https://github.com/michaelhodel/re-arc

michaelhodel2y ago

sestep2y ago

This is exactly what my first step was going to be. Thanks for the link! Saves a lot of time for someone to have already done it.

montag2y ago

What do you mean it can 'generate python programs that solve the tasks'? I can't find any mention of that. I only see hand-coded solutions.

1 more reply

david_shi2y ago· 4 in thread

What is the fastest way to get up to speed with techniques that led to the current SOTA?

gkamradt2y ago

Check out the SOTA resources on the guide

https://arcprize.org/guide

Happy to answer any questions you have along the way

(I'm helping run ARC Prize)

david_shi2y ago

Appreciate you and the team for putting this together, it's a lot of fun just brainstorming potential techniques

ks20482y ago

This looks very helpful: https://github.com/neoneye/arc-notes/tree/main/awesome

david_shi2y ago

Thanks for the link!

Animats2y ago· 3 in thread

> the only eval which measures AGI.

That's a stretch. This is a problem at which LLMs are bad. That does not imply it's a good measure of artificial general intelligence.

This seems to need explicit feature extraction, rather than the combined feature extraction and exploitation LLMs use. Has anyone extracted the rule set from the test cases yet?

elicksaur2y ago

Additionally, the test set is private so it can’t be trained on or extracted from. It has rules that aren’t in the public sets.

[1] https://www.kaggle.com/competitions/abstraction-and-reasonin...

n2d42y ago

The tasks are handmade. There is no "problem generator".

slicerdicer12y ago

AGI is not when the AI is good at some particular thing, AGI is when we have nothing left at which the AI is bad at (compared to humans).

bigyikes2y ago· 3 in thread

What is the fundamental difference between ARC and a standard IQ test? On the surface they seem similar in that they both involve deducing and generalizing visual patterns.

Is there something special about these questions that makes them resistant to memorization? Or is it more just the fact that there are 100 secret tasks?

taneq2y ago

I’ve always found this kind of puzzle infuriating because it’s way underspecified. You’re not trying to find a pattern, you’re trying to guess what pattern the test writer would expect.

gkbrk2y ago

Most of the ARC tasks are intuitive and have one obvious answer. Both on IQ tests and the ARC challenge, people manage to guess what the test writer expects.

Barrin922y ago

1 more reply

visarga2y ago· 3 in thread

8organicbits2y ago

> The system made of humans+LLMs is an AGI.

Pay no attention to the man behind the curtain.

This type of thinking would claim that mechanical turk is AGI, or perhaps that human+pen and paper is AGI. While they are great tools, that's not how I'd characterize them.

visarga2y ago

> Pay no attention to the man behind the curtain.

I could say the same for us, pay no attention to the other humans who are behind the curtain.

1 more reply

cheevly2y ago

I fully, comprehensively agree with your take and have repeatedly arrived at the same conclusions in my research.

geor9e2y ago· 3 in thread

I found them all extremely easy for a while, but then I couldn't figure out the rules of this one at all: e6de6e8f https://i.imgur.com/ExMFGqU.png

optimussupreme2y ago

janalsncm2y ago

Each of the red shapes in the input are separated by black squares. Starting from the green block, rotate the red shapes 90 degrees and stack them downwards.

Thats the general pattern although my description wasn’t very good.

1 more reply

zurfer2y ago

yeah it's off somehow. rule 1: start at the green dot?

rule 2: glue the left outer piece to the bottom

rule 3: overlap every now and then :D

rule 4: invert some of the pieces every now and then

1 more reply

visarga2y ago· 3 in thread

Why doesn't Chollet just make a challenge that reads like "Solve cancer", surely there is no solution in any books.

mewpmewp22y ago

isaacfrond2y ago

PontifexMinimus2y ago

> Why doesn't Chollet just make a challenge that reads like "Solve cancer", surely there is no solution in any books.

Why doesn't a baby just run a marathon before it learns to walk? Because you've got to learn to walk before you can run.

> But not even the whole human society can do it in one go, it's a slow iterative process of ideation and validation.

So you break it down into little steps, which is what is being done here.

p1esk2y ago· 3 in thread

Is there a leaderboard for the no-restriction version of the competition? I want to see how gpt4 does on it.

montag2y ago

Just quoting again from the guide:

"LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." - François Chollet

Additionally, keep in mind that submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

mikeknoopOP2y ago

Yes there is a secondary leaderboard called ARC-AGI-Pub (in beta) with no limitations: https://arcprize.org/leaderboard

p1esk2y ago

nadam2y ago· 2 in thread

zug_zug2y ago

I have the exact same impression.

apendleton2y ago

levocardia2y ago· 2 in thread

Defining intelligence as an efficiency of learning, after accounting for any explicit or implicit priors about the world, makes it much easier to understand why human intelligence is so impressive.

ildon2y ago

Do you remember the title/where to find it?

mischa_u2y ago

"On the Measure of Intelligence" https://arxiv.org/abs/1911.01547

itissid2y ago· 2 in thread

Interesting. It seems most of these task target a very specific part of the brain that recognizes visual patterns. But that alone is cannot possibly be the only definition of intelligence.

I know a few labs working on this, one is in ASU(Kambhampati, Rao et. al) and possibly Google and now maybe open ai.

andoando2y ago

Take for example a simple audiotory pattern like "clap clap clap". This has a very trival mapping as visual like so:

x x x

- - -

house house house

whereas anyone would agree the sound of three equally spaced claps would not be analogous to say:

aa b b b

-- --- -- -- ---

This ability to relate or equate two entirely different senses should clue you in that there is a deeper framework at play

itissid2y ago

1 more reply

ks20482y ago· 2 in thread

This is interesting. I've been looking at the data today and made a helper to quickly view the ARC dataset: https://kts.github.io/arc-viewer/

So you can view 100 per page instead of clicking through one-by-one: https://kts.github.io/arc-viewer/page1/

neoneye22y ago

Nice overview/details. Do you plan on adding more metrics?

Idea for a metric: - Number of pixels that stays the same between input/output. - Histogram changes.

ks20482y ago

Thanks, yeah lots more to look into. Just getting started! Thanks for your work. Your "Awesome ARC" page looks really helpful.

mkl2y ago· 2 in thread

The first example violates the last bit. The red blocks in the way of the rightmost strip start as

  R
  R R
  R R R

but get turned into

  R R
  R R
  R R R

Every other strip matches my rule.

tshadley2y ago

Sure looks like a typo. Contact author?

https://x.com/fchollet https://x.com/arcprize https://x.com/mikeknoop

lopuhin2y ago

yes looks like a bug in the example to me, feel free to report to https://github.com/fchollet/ARC-AGI/issues :)

nmca2y ago· 2 in thread

ARC is a noble endeavour but mistakes visual/spatial reasoning for reasoning and thus fails.

PontifexMinimus2y ago

No, I don't think it does. I think that the ideas in a system that could solve this type of problem would be highly generalisable to other tasks.

nmca2y ago

curious_cat_1632y ago· 2 in thread

So, this is a good idea. Having opinions about what AGI benchmarks should look like is a great way to argue about the kind of technology we want to build for the future.

However, why are the 100 test tasks secret? I don't understand why how resisting “memorization” techniques requires it. Maybe someone can enlighten me.

muglug2y ago

If the tasks were public then it would be trivial to have a human figure out the answers, and then to train an LLM to memorise those answers.

andoando2y ago

Test date is always a secret no, otherwise you can train it on the test data and prod your algo to match the results closely as possible

TheDudeMan2y ago· 2 in thread

Where did the money come from? How about put it toward alignment research instead of accelerating capabilities?

laurent_du2y ago

It comes from Knoop and Chollet's pockets. You are welcome to spend your own money to further whatever matters most to you.

flawn2y ago

Exactly my thoughts...

jolt422y ago· 2 in thread

On puzzle #23 (id: 11e1fe23), I'm sure there's more than one possible valid answer from the examples given. You can't tell if the expected distance is from the gray square or from the RGB squares.

neoneye22y ago

The task is here. https://neoneye.github.io/arc/edit.html?dataset=ARC&task=11e...

There are many examples where the test is slightly OOD (out of distribution), so the solver will have to generalize.

jolt422y ago

Not sure what you mean. There's a viable answer that's marked incorrect. The examples should show the pattern well enough to eliminate possible wrong answers, correct?

1 more reply

lxe2y ago· 2 in thread

gkamradt2y ago

We put a bunch of detail to get started on the guide https://arcprize.org/guide

Happy to answer any questions you have along the way

(I'm helping run ARC Prize)

flawn2y ago

I don't see where this helps @Ixe with getting started (me being in a similar state like him).

mewpmewp22y ago· 2 in thread

montag2y ago

> submissions to Kaggle will not have access to the internet. Using a 3rd-party, cloud-hosted LLM is not possible.

https://arcprize.org/guide

mewpmewp22y ago

This largely takes away any odds at solving this. You definitely can't reproduce that under a million dollars.

I have some ideas I want to try, I might still though. But all of it would require external tools.

nmca2y ago· 1 in thread

Prediction markets on the outcome:

https://manifold.markets/JacobPfau/will-the-arcagi-grand-pri...

0xDEAFBEAD2y ago

Another interesting one:

https://manifold.markets/Tossup/will-the-arcagi-grand-prize-...

logicallee2y ago· 1 in thread

Thank you for this generous contest, which brings important attention to the field of testing for AGI.

>Happy to answer questions!

1. Can humans take the complete test suite? Has any human done so? Is it timed? How long does it take a human? What is the highest a human who sat down and took the ARC-AGI test scored?

2. How surprised would you be if a new model jumped to scoring 100% or nearly 100% on ARC-AGI (including the secret test tasks)? What kind of test would you write next?

neoneye22y ago

There are 100 tasks that is hidden from the public, that is only exposed, when running on an offline computer. So the solver has no prior knowledge about what these tasks are about.

Humans can try the 800 tasks here. There is no time limit. I recommend not starting with the `expert` tasks, but instead go with the `entry` level puzzles. https://neoneye.github.io/arc/?dataset=ARC

If a model jumps to 100%, that may be a clever program or maybe the program has been trained on the 100 hidden tasks. Fchollet has 100 more hidden tasks, for verifying this.

freediver2y ago· 1 in thread

This is amazing, and much needed. Thanks for organizing this. Makes me want to flex the programming muscle again.

dailykoder2y ago

Haha, great post! Well meme'd my friend!

treprinum2y ago· 1 in thread

Why is AGI important? I am worried we will create something slightly better than drosophila and put it in charge of all human-wide decision making...

fennecbutt2y ago

Good. An AI will probably do a better job than our politicians and disillusioned voters.

KBme2y ago· 1 in thread

How can people believe that a censored politically correct process can get even close to something like AGI is baffling to me. Lysenkoism in computing.

gushogg-blake2y ago

What's censored/politically correct about ARC? Or do you mean AGI research in general?

arcastroe2y ago· 1 in thread

I'm curious, if it turns out that a simple rule-based algorithm exists, specifically tailored to solve (only!) ARC style problems, without generalization, would that still qualify for the reward?

montag2y ago

I don't think that's breaking any rules, and in fact it would help to expose a whole class of weaknesses in the test.

dskloet2y ago· 1 in thread

Puzzle 00576224 is ambiguous because the example input is symmetrical but the test input isn't.

itsgrimetime2y ago

Scroll over on the test input, there’s another example in the set that disambiguates

elicksaur2y ago

I’m a big fan of the ARC as a problem set to tackle. The sparseness of the data and infinite-ness of the rules which could apply make it much tougher than existing ML problem sets.

bigyikes2y ago

https://youtu.be/UakqL6Pj9xo

btbuildem2y ago

1: https://www.crn.com/news/applications-os/220100498/researche...

dang2y ago

Related ongoing thread:

Francois Chollet: OpenAI has set back the progress towards AGI by 5-10 years - https://news.ycombinator.com/item?id=40652818 - June 2024 (5 comments)

Lerc2y ago

I watched a video that covered ARC-AGI a few days ago, It had links to the old competition. It gave me much to think about. Nice to see a new run at it.

Not sure If I have the skills to make an entry, but I'll be watching at least.

Retr0id2y ago

Some very hand-wavey (and late) thoughts from an outsider:

nojvek2y ago

I did a few human examples by hand, but gotta do more of them to start seeing patterns.

skywhopper2y ago

“Given the success and proven economic utility of LLMs over the past 4 years, the above may seem like extraordinary claims. Strong claims require strong evidence.”

PontifexMinimus2y ago

The website gives an example:

    {
      "train": [
        {"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]},
        {"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]},
        {"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]}
      ],
      "test": [
        {"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]}
      ]
    }

But why restrict yourself to JSON that codes for 2-d coloured grids? Why not also allow:

    {
      "train": [
        {"input": [[1, 0], [0, 0]], "output": 1},
        {"input": [[0, 0], [4, 0]], "output": 4},
        {"input": [[0, 0], [6, 0]], "output": 6}
      ]
    }

Where the rule might be to output the biggest number in the input, or add them up (and the solver has to work out which).

ryanoptimus2y ago

Looks like bongard problems for the referenced problem solving tasks https://en.wikipedia.org/wiki/Bongard_problem

z3phyr2y ago

I can see many problems can be solved with modern symbolic approaches like theorem provers, dependent types, pattern matching etc. But I will have to dive in to actually confirm it.

chairhairair2y ago

These puzzles are fun and challenging in the same way that puzzles from video games like The Witness and Baba Is You are.

I bet you could use those puzzles as benchmarks as well.

ilaksh2y ago

HarHarVeryFunny2y ago

I have two questions:

1) Who is providing the prize money, and if it is yourself and Francois personally, then what is your motivation ?

blendergeek2y ago

The tests are only playable by people with normal color-vision.

Is there a "color-blind friendly" mode?

PontifexMinimus2y ago

Just to let you know I found your website unreadable due to:

- annoying animated background

- white text on black background

- annoying font choices

Which is unfortunate because (as I found when I used Firefox reader mode) you're discussing important and interesting stuff.

mishamagic2y ago

https://buildermath.substack.com/p/talking-to-ais-arc-1

bilsbie2y ago

Reach out if anyone wants to work on this. I think it would be more fun as a group.

djoldman2y ago

Anyone have a list of benchmarks that do not release the actual test set?

Anyone else share the suspicion that ML rapidly approaching 100% on benchmarks is sometimes due to releasing the test set?

ummonk2y ago

What kind of "bigger labs" have attempted it and how much was their training budget?

lenerdenator2y ago

What guarantee exists to make sure that the intelligence developed has an inclination towards good?

flawn2y ago

Do we want to find AGI yet though?

chx2y ago

adamgordonbell2y ago

AGI won't struggle with colors like some of us then.

empath752y ago

This is like offering a one million dollar prize for curing cancer. It's sort of pointless to offer a prize for something people are spending orders of magnitude more on trying to do anyway.

lamontcg2y ago

AGI should really be able to do what only a select few humans can do and construct its own mathematical systems to prove presently unsolved conjectures (the Shinichi Mochizuki test of AGI).

s1k3s2y ago

Is this open as in "OpenAI" or what are we doing here?

thatxliner2y ago

So... isn't this basically just a CAPTCHA

EternalFury2y ago

If it passed The Area 101 Test, it would already be amazing, as this is a trivial test that goes against the fundamental principles of LLMs.

barfbagginus2y ago

If someone had AGI, wouldn't it be far more lucrative than $1m to keep it under wraps and use it to do business with a huge technical advantage?

I feel like a prize of a billion dollars would be more effective.

m3kw92y ago

Low balling the crowd with this I see

breck2y ago

I can beat the SOTA using ICS (https://breckyunits.com/intelligence.html)

If you make your site public domain, and drop the (C), I'll compete.

j / k navigate · click thread line to collapse