~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.
We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)
Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.
Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)
Also, some other back of envelope calculations:
The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.
The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)
I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.
it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?
Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?
To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.
Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.
Interesting times.
On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.
report says it is $17 per task, and $6k for whole dataset of 400 tasks.
Fundamentally it's a search through some enormous state space. Advancements are "tricks" that let us find useful subsets more efficiently.
Zooming way out, we have a bunch of social tricks, hardware tricks, and algorithmic tricks that have resulted in a super useful subset. It's not the subset that we want though, so the hunt continues.
Hopefully it doesn't require revising too much in the hardware & social bag of tricks, those are lot more painful to revisit...
You compare this to "a human" but also admit there is a high variation.
And, I would say there are a lot humans being paid ~=$3400 per month. Not for a single task, true, but for honestly for no value creating task at all. Just for their time.
So what about we think in terms of output rather than time?
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
Though, of course one can argue, that lots of human written code is not much different from this.
I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
LLMs are below human evaluation, as I last looked, but it doesn't get much attention.
Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.
We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.
Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.
Models have regularly made progress on it, this is not new with the o-series.
Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.
I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.
Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.
This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.
Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.
This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.
Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!
One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.
I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
But, still, this is incredibly impressive.
- 64.2% for humans vs. 82.8%+ for o3.
...
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
It really calls into question two things.
1. You don't know what you're talking about about.
2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.
Either way, not a good look.
What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)
ARC has been challenging precisely because solving its problems often requires:
1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND
2) using the right level(s) of abstraction
Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.
[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...
ADDED:
Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.
I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)
LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.
Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.
As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.
As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.
I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.
Heck we aren't close to P with commercial models.
The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?
For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.
Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.
i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.
Every human does this dozens, hundreds or thousands of times ... during childhood.
I think this is a mistake.
Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.
Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?
There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.
So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.
“SO IS IT AGI?
ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.
Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”
The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.
Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…
Something I missed until I scrolled back to the top and reread the page was this
> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set
So yeah, the results were specifically from a version of o3 trained on the public training set
Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.
On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.
The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
If humans were given the json as input rather than the images, they’d have a hard time, too.
I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.
Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.
From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:
1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)
2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.
3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.
4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.
5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
Took me less time to figure out the 3 examples that it took to read your post.
I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.
Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.
Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.
Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies
But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.
humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.
Maybe it would help to include some human results in the AI ranking.
I think we'd find that Humans score lower?
Indistinguishable from goalpost moving like you said, but also no true Scotsman.
I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?
It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.
My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.
If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?
(I am sure I am missing something.)
Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.
We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).
Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.
> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."
I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.
The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.
Like in your test
a hand grenade and a pin - don't pull the pin.
Or maybe a mousetrap? but maybe that would be defused?
in the ai test...
or Global Thermonuclear War, the only winning move is...
I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?
Or did you mean they used to play Mozart at 5x speed before they became amputees?
(nothing wrong with it! I'm just trying to prune the top subthread)
In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.
I'd love to know more about this.
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.
In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.
So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.
These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
That makes something like this competitive in ~3 years
Sonnet 3.5 remains the king of the hill by quite some margin
That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.
In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:
1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.
In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.
However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.
Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.
What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.
https://myswamp.substack.com/p/benchmarking-llms-against-com...
For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain
For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.
Carefully analyze questions to not overlook subtle details. Take each question "as-is", don't guess what they mean -- interpret them as any reasonable person would.
Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.
Thanks again openAI for showing us you don’t give a shit about actual people.
>> Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis.
"Program synthesis" is here used in an entirely idiosyncratic manner, to mean "combining programs". Everyone else in CS and AI for the last many decades has used "Program Synthesis" to mean "generating a program that satisfies a specification".
Note that "synthesis" can legitimately be used to mean "combining". In Greek it translates literally to "putting [things] together": "Syn" (plus) "thesis" (place). But while generating programs by combining parts of other programs is an old-fashioned way to do Program Synthesis, in the standard sense, the end result is always desired to be a program. The LLMs used in the article to do what F. Chollet calls "Porgram Synthesis" generate no code.
I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?
I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.
I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.
The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
So, next step in reasoning is open world reasoning now?
If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!
Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.
What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
While there are those that are excited, the world is not prepared for the level of distress this could put on the average person without critical changes at a monumental level.
"It is ceasing to be a matter of how we think about technics, if only because technics is increasingly thinking about itself. It might still be a few decades before artificial intelligences surpass the horizon of biological ones, but it is utterly superstitious to imagine that the human dominion of terrestrial culture is still marked out in centuries, let alone in some metaphysical perpetuity. The high road to thinking no longer passes through a deepening of human cognition, but rather through a becoming inhuman of cognition, a migration of cognition out into the emerging planetary technosentience reservoir, into 'dehumanized landscapes ... emptied spaces' where human culture will be dissolved. Just as the capitalist urbanization of labour abstracted it in a parallel escalation with technical machines, so will intelligence be transplanted into the purring data zones of new software worlds in order to be abstracted from an increasingly obsolescent anthropoid particularity, and thus to venture beyond modernity. Human brains are to thinking what mediaeval villages were to engineering: antechambers to experimentation, cramped and parochial places to be.
[...]
Life is being phased-out into something new, and if we think this can be stopped we are even more stupid than we seem." [0]
Land is being ostracized for some of his provocations, but it seems pretty clear by now that we are in the Landian Accelerationism timeline. Engaging with his thought is crucial to understanding what is happening with AI, and what is still largely unseen, such as the autonomization of capital.
Sure, there will be growing pains, friction, etc. Who cares? There always is with world-changing tech. Always.
For one, I found AI coding to work best in a small team, where there is an understanding of what to build and how to build it, usually in close feedback loop with the designers / users. Throw the usual managerial company corporate nonsense on top and it doesn't really matter if you can instacreate a piece of software, if nobody cares for that piece of software and it's just there to put a checkmark on the Q3 OKR reports.
Furthermore, there is a lot of software to be built out there, for people who can't afford it yet. A custom POS system for the local baker so that they don't have to interact with a computer. A game where squids eat algae for my nephews at christmas. A custom photo layout software for my dad who despairs at indesign. A plant watering system for my friend. A local government information website for older citizens. Not only can these be built at a fraction of the cost they were before, but they can be built in a manner where the people using the software are directly involved in creating it. Maybe they can get a 80% hacked version together if they are technically enclined. I can add the proper database backend and deployment infrastructure. Or I can sit with them and iterate on the app as we are talking. It is also almost free to create great documentation, in fact, LLM development is most productive when you turn up software engineering best practices up to 11.
Furthermore, I found these tools incredible for actively furthering my own fundamental understanding of computer science and programming. I can now skip the stuff I don't care to learn (is it foobarBla(func, id) or foobar_bla(id, func)) and put the effort where I actually get a long-lived return. I have become really ambitious with the things I can tackle now, learning about all kinds of algorithms and operating system patterns and chemistry and physics etc... I can also create documents to help me with my learning.
Local models are now entering the phase where they are getting to be really useful, definitely > gpt3.5 which I was able to use very productively already at the time.
Writing (creating? manifesting? I don't really have a good word for what I do these days) software that makes me and real humans around me happy is extremely fulfilling, and has allevitated most of my angst around the technology.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
I'll get concerned when it stops sucking so hard. It's like talking to a dumb robot. Which it unsurprisingly is.
It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.
A bit puzzling to me. Why does it matter ?
In reality it seems to be a bit of both - there is some general intelligence based on having been "trained on the internet", but it seems these super-human math/etc skills are very much from them having focused on training on those.
> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.
I found this such an intriguing way of thinking about it.
Not so sure - but we might need to figure out the inference/search/evaluation strategy in order to provide the data we need to distill to the single forward-pass data fitting.
Of course is a chance we will find ourselves in Utopia, but yeah, a chance.
Serious question. I've browsed around, looked for the official release, but it seems to be just hear-say for now, except for the few little bits in the ARC-AGI article.
So some of the reactions seems quite far-fetched. I was quite amazed at first seeing the benchmarks, but then actually read the ARC-AGI article and a few other things about how it worked, learned a bit more about the different benchmarks, and realised we've no proper idea yet how o3 is working under the hood, the thing isn't even realeased.
It could be doing the same thing that chess-engines do except in several specific domains. Which would be very cool, but not necessarily "intelligent" or "generally intelligent" in any sense whatsoever! Will that kind of model lead to finding novel mathematical proofs, or actually "reasoning" or "thinking" in any way similar to a human, remains entirely uncertain.
(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)
Even if productivity skyrockets, why would anyone assume the dividends would be shared with the "destroy[ed] middle class"?
All indications will be this will end up like the China Shock: "I lost my middle class job, and all I got was the opportunity to buy flimsy pieces of crap from a dollar store." America lacks the ideological foundations for any other result, and the coming economic changes will likely make building those foundations even more difficult if not impossible.
Unless something changes, if I was a billionaire I would be ecstatic at the moment. Now even the impossible seems potentially possible if this delivers on its promises (e.g. go to Mars, build a utopia for my inner circle, etc). I no longer need other people to have everything. Previously there was no point in money if I didn't have a place to spend it/people to accept it. Now with real assets I can use AI/machines to do what I want - I no longer need "money" or more accurately other people to live a very wealthy life.
Again this is all else being equal. Lots of other things could change, but with increasing surveillance by use of technology I doubt large revolutions/etc will ever get the chance to get off the ground or have the scale to be effective.
Interesting times.
But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.
Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.
(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
Was it zero-shot at least and Pass@1 ? I guess it was not zero-shot, since it shows examples of other similar problems and their solutions. It also sounds like it was fine-tuned on that specific task.
Look, maybe this shows that it could soon be used to replace some MTurk style workers, but I don't know that counts as AGI. To me AGI, it needs to be able to solve novel problems, to adapt to all situations without fine-tuning, and to operate at much larger dimensions, like don't make it a grid of pixels, make it 4k images at least.
Now I am wondering what Anthropic will come up with. Exciting times.
The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
I wonder what exactly o3 costs. Does it still spend a terrible amount of time thinking, despite being finetuned to the dataset?
> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
If we feel like we've really "hit the ceiling" RE efficiency, then that's a different story, but I don't think anyone believes this at this time.
If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?
Please share. I’m compiling a list.
85% is just the (semi-arbitrary) threshold for the winning the prize.
o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.
...
Here's the full breakdown by dataset, since none of the articles make it clear --
Private Eval:
- 85%: threshold for winning the prize [1]
Semi-Private Eval:
- 87.5%: o3 (unlimited compute) [2]
- 75.7%: o3 (limited compute) [2]
Public Eval:
- 91.5%: o3 (unlimited compute) [2]
- 82.8%: o3 (limited compute) [2]
- 64.2%: human average (Mechanical Turk) [1] [3]
Public Training:
- 76.2%: human average (Mechanical Turk) [1] [3]
...
References:
[1] https://arcprize.org/guide
def letter_count(string, letter):
if string == “strawberry” and letter == “r”:
return 3
…> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval
The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval
Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)
Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation
"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."
Hurray ! Put limited version of that on everybody phones !
That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.
(1995 called. It wants its web design back.)
> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.
I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.
There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.
I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.
Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.
I fear text pairings with CLIP or OCR constrain a model too much and confuse
Can machines be more human-like in their pattern recognition? O3 met this need today.
While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.
What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
https://news.ycombinator.com/item?id=42344336
And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:
[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?
That's what a human-level performer would do.
State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.
Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
Isn’t that the premise behind the CAPTCHA?
That would be intelligent. Everything else is just stupid and more of the same shit.
> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.
Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.
This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.
Make it possible->Make it fast->Make it Cheap
the eternal cycle of software.
Make no mistake - we are on the verge of the next era of change.
TLDR: The cacophony of fools is so loud now. Thank goodness it won't last.
But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.
OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.
Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.
Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.
If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.
It will be fascinating to see how this unfolds.
Congrats to OAI on yet another fantastic release.
Thats how i understand it
https://codeforces.com/blog/entry/133094
That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.
Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.
c6e1b8da is moving rectangular figures by a given vector, 0d87d2a6 is drawing horizontal and/or vertical lines (connecting dots at the edges) and filling figures they touch, b457fec5 is filling gray figures with a given repeating color pattern.
This is pretty straightforward stuff that doesn't require much spatial thinking or keeping multiple things/aspects in memory - visual puzzles from various "IQ" tests are way harder.
This said, now I'm curious how SoTA LLMs would do on something like WAIS-IV.
What took me longer was figuring out how the question was arranged, i.e. left input, right output, 3 examples each
I = E / K
where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.
For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.
Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.
Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.
IK = E
low intelligence * vast knowledge = reasonable effectiveness
Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.
Probably less disruption than will happen in 1st world countries.
> No one will have chance to be rich anymore
It's strange to reach this conclusion from "look, a massive new productivity increase".
Do you work at one of the frontier labs?
As for the wealth disparity between rich and poor countries, it’s hard to know how politics will handle this one, but it’s unlikely that poor countries won’t also be drastically richer as the cost of basic living drops to basically zero. Imagine the cost of food, energy, etc in an ASI world. Today’s luxuries will surely be considered human rights necessities in the near future.
So much for a plateau lol.
It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.
This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...
Happy to be wrong on this.
These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?
That's the most plausible definition of AGI i've read so far.
Instrumental reason FTW
> Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.
If there are jobs paying $150K just to code (someone else tells you what to code, and you just code it up), then please share!
Generalist junior and senior engineers will need to think of a different career path in less than 5 years as more layoffs will reduce the software engineering workforce.
It looks like it may be the way things are if progress in the o1, o3, oN models and other LLMs continues on.
This I think will happen with programmers. Rote programming will slowly die out, while demand for super high end will go dramatically up in price.
It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.
It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.
So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)
Did it, really? Or did it just provide automation for routine no-thinking-necessary text-writing tasks, but is still ultimately completely bound by the level of human operator's intelligence? I strongly suspect it's the latter. If it had actually replaced journalists it must be junk outlets, where readers' intelligence is negligible and anything goes.
Just yesterday I've used o1 and Claude 3.5 to debug a Linux kernel issue (ultimately, a bad DSDT table causing TPM2 driver unable to reserve memory region for command response buffer, the solution was to use memmap to remove NVS flag from the relevant regions) and confirmed once again LLMs still don't reason at all - just spew out plausible-looking chains of words. The models were good listeners, and a mostly-helpful code generators (when they didn't make silliest mistakes), but they gave no traces of understanding and no attention for any nuances (e.g. LLM used `IS_ERR` to check `__request_resource` result, despite me giving it full source code for that function and there's even a comment that makes it obvious it returns a pointer or NULL, not an error code - misguided attention kind of mistake).
So, in my opinion, LLMs (as currently available to broad public, like myself) are useful for automating away some routine stuff, but their usefulness is bounded by the operator's knowledge and intelligence. And that means that the actual jobs (if they require thinking and not just writing words) are safe.
When asked about what I do at work, I used to joke that I just press buttons on my keyboard in fancy patterns. Ultimately, LLMs seem to suggest that it's not what I really do.
Ford didn’t support a 40 hour work week out of the kindness of his heart. He wanted his workers to have time off for buying things (like his cars).
I wonder if our AGI industrialist overlords will do something similar for revenue sharing or UBI.
That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
Generally with AI think the top of society stand to gain a lot more than the middle/bottom of it for a whole host of reasons. If you think anything different your framework you use to make your conclusion is probably wrong at least in IMO.
I don't like saying this but there is a reason why the "AI bros", VC's, big tech CEO's, etc are all very very excited about this and many employees (some commenting here) are filled with dread/fear. The sales people, the managers, the MBA's, etc stand to gain a lot from this. Fear also serves as the best marketing tool; it makes people talk and spread OpenAI's news more so than everything else. Its a reason why targeting coding jobs/any jobs is so effective. I want to be wrong of course.
Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.
Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.
They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!