~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.
We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)
Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)
Also, some other back of envelope calculations:
The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.
The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)
I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.
it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?
Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?
To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.
Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.
Interesting times.
On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.
report says it is $17 per task, and $6k for whole dataset of 400 tasks.
Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.
Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.
Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...
You compare this to "a human" but also admit there is a high variation.
And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.
So what about we think in terms of output rather than time?
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
Though, of course one can argue, that lots of human written code is not much different from this.
I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
LLMs are below human evaluation, as I last looked, but it doesn't get much attention.
Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.
We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.
Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.
Models have regularly made progress on it, this is not new with the o-series.
Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.
I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.
Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.
This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.
Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.
This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.
Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!
One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.
I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
But, still, this is incredibly impressive.
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
It really calls into question two things.
1. You don't know what you're talking about about.
2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.
Either way, not a good look.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.
Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.
As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.
As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.
I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.
Heck we aren't close to P with commercial models.
The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?
For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.
Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.
i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.
Every human does this dozens, hundreds or thousands of times ... during childhood.
If you look at the ARC tasks failed by o3, they're really not well suited to humans. They lack the living context humans thrive on, and have relatively simple, analytical outcomes that are readily processed by simple structures. We're unlikely to see AI as "smart" until it can be asked to accomplish useful units of productive professional work at a "seasoned apprentice" level. Right now they're consuming ungodly amounts of power just to pass some irritating, sterile SAT questions. Train a human for a few hours a day over a couple weeks and they'll ace this no problem.
It works the same with humans. If they spend more time on the puzzle they are more likely to solve it.
While beyond current motels, that would be the final test of AGI capability.
I think this is a mistake.
Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.
Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?
There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.
So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.
“SO IS IT AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.
Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.
The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
https://a16z.com/llmflation-llm-inference-cost/
Look at the log scale slope, especially the orange MMLU > 83 data points.
that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
If humans were given the json as input rather than the images, they’d have a hard time, too.
We shine light in text patterns at humans rather than inject the text directly into the brain as well, that is extremely unfair! Imagine how much better humans would be at text processing if we injected and extracted information from their brains using the neurons instead of eyes and hands.
I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.
Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.
From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:
1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)
2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.
3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.
4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.
5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
Took me less time to figure out the 3 examples that it took to read your post.
I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.
Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.
Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.
Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies
But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.
humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.
Maybe it would help to include some human results in the AI ranking.
I think we'd find that Humans score lower?
Indistinguishable from goalpost moving like you said, but also no true Scotsman.
I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?
It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.
My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.
If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?
(I am sure I am missing something.)
In practice I assume they just gave them the benchmarks and took it on the honor system they wouldn't cheat, yeah. They can always cook up a new test set for next time, it's only 10% of the benchmark content anyway and the results are pretty close.
There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.
Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.
We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).
Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.
> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."
I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.
The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.
Like in your test
a hand grenade and a pin - don't pull the pin.
Or maybe a mousetrap? but maybe that would be defused?
in the ai test...
or Global Thermonuclear War, the only winning move is...
I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?
Or did you mean they used to play Mozart at 5x speed before they became amputees?
(nothing wrong with it! I'm just trying to prune the top subthread)
In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.
I'd love to know more about this.
Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations
"Well, yeah, but its kind of expensive" -- this guy
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.
In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.
So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.
These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
That makes something like this competitive in ~3 years
From what I can see, presuming o3 is a progression of o1 and has good level of accountabiltiy bubbling up during 'inference' (i.e. "Thinking about ___") then I'd say it's just using up millions of old-school tokens (the 44 million tokens that are referenced). So not latent thinking per se.
Sonnet 3.5 remains the king of the hill by quite some margin
That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.
In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:
1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.
In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.
However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.
Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.
What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.
https://myswamp.substack.com/p/benchmarking-llms-against-com...
For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain
For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.
Carefully analyze questions to not overlook subtle details. Take each question "as-is", don't guess what they mean -- interpret them as any reasonable person would.
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.
Thanks again openAI for showing us you don’t give a shit about actual people.
What a weird way to react to this.
Many of us look forward to what a future with AGI can do to help humanity and hopefully change society for the better, mainly to achieve a post scarcity economy.
>> Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis.
"Program synthesis" is here used in an entirely idiosyncratic manner, to mean "combining programs". Everyone else in CS and AI for the last many decades has used "Program Synthesis" to mean "generating a program that satisfies a specification".
Note that "synthesis" can legitimately be used to mean "combining". In Greek it translates literally to "putting [things] together": "Syn" (plus) "thesis" (place). But while generating programs by combining parts of other programs is an old-fashioned way to do Program Synthesis, in the standard sense, the end result is always desired to be a program. The LLMs used in the article to do what F. Chollet calls "Porgram Synthesis" generate no code.
Combining programs should be straightforward for DNNs, ordering, mixing, matching concepts by coordinates and arithmetic in learned high-dimensional embedded-space. Inference-time combination is harder since the model is working with tokens and has to keep coherence over a growing CoT with many twists, turns and dead-ends, but with enough passes can still do well.
The logical next step to improvement is test-time training on the growing CoT, using reinforcement-fine-tuning to compress and organize the chain-of-thought into parameter-space--if we can come up with loss functions for "little progress, a lot of progress, no progress". Then more inference-time with a better understanding of the problem, rinse and repeat.
I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?
I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.
I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
That was always true for NNs in general, yet it took a very specific structure to get to where we are now. (..with a certain amount of time and resources.)
> thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out
Sounds interesting, would you elaborate?
* The upward bound of compute/performance gains as we continue to iterate on LLMs. It simply isn't going to be feasible for a lot of engineers and businesses to run/train their own LLMs. This means an inherent reliance on cloud services to bridge the gap (something MS is clearly betting on), and engineers to build/maintain the integration from these services to whatever business logic their customers are buying.
* Skilled knowledge workers continuing to be in-demand, even factoring in automation and new-grad numbers. Collectively, we've built a better hammer; it still takes someone experienced enough to know where to drive the nail. These tools WILL empower the top N% of engineers to be more productive, which is why it will be more important than ever to know _how_ to build things that drive business value, rather than just how to churn through JIRA tickets or turn a pretty Figma design into React.
In general, with the technology advancing as rapidly as it is, and the trillions of dollars oriented towards replacing knowledge work, I don't see a future in this field. And that's despite me being on a very promising path myself! I'm 25, in the middle of a CS PhD in Germany, with an impressive CV behind me. My head may be the last on the chopping block, but I'd be surprised if it buys me more than a few years once programmer obsolescence truly kicks in.
Indeed, what I think are safe jobs are jobs with fundamental human interaction. Nurses, doctors, kindergarten teachers. I myself have been considering pivoting to becoming a skiing teacher.
Maybe one good thing that comes out of this is breaking my "wunderkind" illusion. I spent my teens writing C++ code instead of going out socializing and making friends. Of course, I still did these things, but I could've been far less of a hermit.
I mirror your sentiment of spending these next few years living life; Real life. My advice: Stop sacrificing the now for the future. See the world, go on hikes with friends, go skiing, attend that bouldering thing your friends have been telling you about. If programming is something you like doing, then by all means keep going and enjoy it. I will likely keep programming too, it's just no longer the only thing I focus on.
Edit: improve flow of last paragraph
Feels like I hit the real world just a couple years too late to get situated in a solid position. Years of obsession in attempt to catch up to the wizards, chasing the tech dream. But this, feels like this is it. Just watching the timebomb tick. I'd love to work on what feels like the final technology, but I'm not a freakshow like what these labs are hiring. At least I get to spectate the creation of humanity's greatest invention.
This announcement is just another gut punch, but at this point I should expect its inevitable. A Jason Voorhees AGI, slowly but surely to devour all the talents and skills information workers have to offer.
Apologies for the rambly and depressing post, but this is reality for anyone recently out or still in school.
Everything seems so uncertain, and the pace of technological advancement makes long-term planning feel almost impossible. Your plan to move to a slower-paced area and enjoy the outdoors sounds incredibly grounding - it's something I've been considering myself.
So, next step in reasoning is open world reasoning now?
If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!
Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.
What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.
A bit puzzling to me. Why does it matter ?
> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
I found this such an intriguing way of thinking about it.
Serious question. I've browsed around, looked for the official release, but it seems to be just hear-say for now, except for the few little bits in the ARC-AGI article.
So some of the reactions seems quite far-fetched. I was quite amazed at first seeing the benchmarks, but then actually read the ARC-AGI article and a few other things about how it worked, learned a bit more about the different benchmarks, and realised we've no proper idea yet how o3 is working under the hood, the thing isn't even realeased.
It could be doing the same thing that chess-engines do except in several specific domains. Which would be very cool, but not necessarily "intelligent" or "generally intelligent" in any sense whatsoever! Will that kind of model lead to finding novel mathematical proofs, or actually "reasoning" or "thinking" in any way similar to a human, remains entirely uncertain.
(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
Was it zero-shot at least and Pass@1 ? I guess it was not zero-shot, since it shows examples of other similar problems and their solutions. It also sounds like it was fine-tuned on that specific task.
Look, maybe this shows that it could soon be used to replace some MTurk style workers, but I don't know that counts as AGI. To me AGI, it needs to be able to solve novel problems, to adapt to all situations without fine-tuning, and to operate at much larger dimensions, like don't make it a grid of pixels, make it 4k images at least.
Now I am wondering what Anthropic will come up with. Exciting times.
The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?
Please share. I’m compiling a list.
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
def letter_count(string, letter):
if string == “strawberry” and letter == “r”:
return 3
…> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval
The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval
Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)
Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation
"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."
Hurray ! Put limited version of that on everybody phones !
That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.
(1995 called. It wants its web design back.)
> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.
I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.
There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.
I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.
Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.
I fear text pairings with CLIP or OCR constrain a model too much and confuse
Can machines be more human-like in their pattern recognition? O3 met this need today.
While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.
What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
https://news.ycombinator.com/item?id=42344336
And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:
[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?
That's what a human-level performer would do.
State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.
Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
Isn’t that the premise behind the CAPTCHA?
That would be intelligent. Everything else is just stupid and more of the same shit.
> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.
Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.
This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.
Make it possible->Make it fast->Make it Cheap
the eternal cycle of software.
Make no mistake - we are on the verge of the next era of change.
TLDR: The cacophony of fools is so loud now. Thank goodness it won't last.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
It will be fascinating to see how this unfolds.
Congrats to OAI on yet another fantastic release.
https://codeforces.com/blog/entry/133094
That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.
Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.
So much for a plateau lol.
That's the most plausible definition of AGI i've read so far.
It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.
It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.
So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)
That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.
Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.
They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!