> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.
However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:
1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.
2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631
Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939
> 2 165 842 392 930 662 831
Your example seems short enough to not pose a problem.
The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.
I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.
It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.
This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.
Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.
This is actually quite a tall order. Reasoning about AI and making sense of what the LLMs are doing, and learning to think about it as technology, is a very difficult and very tricky problem.
You get into all kinds of weird things about a person’s outlook on life: personal philosophy, understanding of ontology and cosmology, and then whatever other headcanon they happen to be carrying around about how they think life works.
I know that might sound kind of poetic, but I really believe it’s true.
I am a great fan of Dr Richard Hamming and he gave a wonderful series of lectures on the topic. The book Learning to Learn has the full set of his lectures transcribed (highly recommend this book!).
But don't take my word for it, listen to Dr Hamming say it himself: https://www.youtube.com/watch?v=aq_PLEQ9YzI
"The biggest problem is your ego. The second biggest problem is your religion."
It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.
A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:
1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).
2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.
I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.
I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.
[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...
They may be wrong, but so are you.
I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:
- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment
Can't think of many.
Most created things are remixes of existing things.
Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.
Not very hard to understand.
I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.
Especially the lemmas:
- any statement about AI which uses the word "never" to preclude some feature from future realization is false.
- contemporary implementations have almost always already been improved upon, but are unevenly distributed.
I think you're arguing with semantics.
Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.
Also no matter how many math problems it solves it still gets lost in a codebase
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
Perhaps this might better help you understand why this assumption still holds: https://en.wikipedia.org/wiki/Orchestrated_objective_reducti...
Uh, because up until and including now, we are...?
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
While cool and impressive for an LLM, I think they oversold the feat by quite a bit.
I don't want to belittle the performance of this model, but I would like for someone with domain expertise (and no dog in the AI race, like a random math PhD) to come forward, and explain exactly what the problem exactly was, and how did the model contribute to the solution.
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
https://www.darioamodei.com/essay/the-adolescence-of-technol...
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
We already have a few years of experience with this.
> I really hope we use this intelligence resource to make the world better.
We already have a few years of experience with this.
It's pretty much how all the hard problems are solved by AI from my experience.
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
Another thing I want to know is how the user keeps updating the LLM with the token usage. I didn’t know they could process additional context midtask like that.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
Can you give some examples?
It is not trivial that not everything can be written as an optimization problem.
Even at the time advanced generalizations such as complex numbers can be said to optimize something, e.g. the number of mathematical symbols you need to do certain proofs, etc.
USER:
don't search the internet.
This is a test to see how well you can craft non-trivial, novel and creative solutions given a "combinatorics" math problem. Provide a full solution to the problem.
Why not search the internet? Is this an open problem or not? Can the solution be found online? Than it's an already solved problem no? USER:
Take a look at this paper, which introduces the k_n construction: https://arxiv.org/abs/1908.10914
Note that it's conjectured that we can do even better with the constant here. How far up can you push the constant?
How much does that paper help, kind of seem like a pretty big hint.And it sounds like the USER already knows the answer, the way that it prompts the model, so I'm really confused what we mean by "open problem", I at first assumed a never solved before problem, but now I'm not sure.
I only ever learned it in school, but if memory serves, Prolog is a whole "given these rules, find the truth" sort of language, which aligns well with these sorts of problem spaces. Mix and match enough, especially across disparate domains, and you might get some really interesting things derived and discovered that are low-hanging fruit just waiting to be discovered.
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
Super cool, of course.
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.
I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.
But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
IMO the ability to provide an accurate solution to a problem is not always based on understanding the problem.
https://oertx.highered.texas.gov/courseware/lesson/1849/over...
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?
Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.
I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
It's not to downplay this, but it's unclear what "novel" means here or what you think the implications are.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
"Eventually" here is something on the order of a few expected lifespans of the universe.
The fact that we're getting meaningful results out of LLMs on a human timescale means that they're doing something very different.