I think with the right problem and the right agentic loop it’s clear to me improvements will speed up.
Four top logical people in the world are acknowledging this. It is mind-blowing and we don't know why.
Several people had problems with Sonnet burning through all their credits grinding on a problem it can't solve. Opus fixes this — it has a confidence threshold below which it exits the task instead of grinding.
"I spent ~$100 last week testing both against multiplication. Sonnet at 37-digit × 37-digit (~10³⁷) never quits — 15+ minutes, 211KB of output, still actively decomposing numbers when I stopped it. Opus will genuinely attempt up to ~50 digits (112K tokens on a real try), starts doubting around 55 digits, and by 80-digit × 80-digit surrenders in 330 tokens / 9 seconds with an empty answer." -- Opus, helping me with the data
The "I don't think this is worth attempting" heuristic is the difference. Sonnet doesn't have it, or has it set much higher. In order to get Opus and some other models to work on harder problems that it assumes it is not worth attempting, it requires an increase of confidence level.
I'll finish writing this up this week. I'm making flashy data visual animations to make the point right now.
The Universe seems against free lunches… if AGI is possible, finding a manager good enough to get the AGI to update its timesheet will not be (in practice).
Also, the “encouragement helps” anecdote seems real in the AlphaEvolve workflow, but I can't see that forpublic models. Gómez-Serrano says this in Quanta (https://www.quantamagazine.org/the-ai-revolution-in-math-has... rived-20260413/), and the released AlphaEvolve notebooks really do contain prompts like “Good luck, I believe in you...” (https://github.com/google-deepmind/alphaevolve_repository_of... oblems, e.g. https://github.com/google-deepmind/alphaevolve_repository_of... blems/blob/main/experiments/finite_field_kakeya_problem/finite_f ield_kakeya.ipynb). But those prompts also bundled strong structural hints (“find a general solution”, “better constructions are possible”), so from my reading the evidence is: prompt phrasing matters, especially in an internal search stack, but not “pep talks are a universal reasoning hack.”
Nothing I said contradicts this.
Here is the first attempt of what I'm testing. [0] Haiku can get the correct answer to `floor( (1234567 * 8901234) / 12345 )` or
``` Math.floor( (Math.floor(Math.random() * 9000000 + 1000000) * Math.floor(Math.random() * 9000000 + 1000000)) / Math.floor(Math.random() * 9000000 + 1000000) ) ```
Given this Haiku will give a correct answer 77.8% of the time. Add one digit or remove a digit, it is very highly predictable also.
That is the WHOLE point. The models are predictable!
Given that prompt Sonnet at 37-digit × 37-digit (~10³⁷) never quits a predictable percentage of the time!
And, Opus at 80-digit × 80-digit simply quits after 9 seconds and 333 tokens!
This is the amazing thing people are not discussing. The models are very predictable.
The AI companies are not posting this information because it shows how unreliable the models are, however, I think there is great virtue that the models are consistently unreliable.
[0] https://github.com/adam-s/agent-tuning/blob/main/application...
Originally LLMs would get stuck in infinite loops generating tokens forever. This is bad, so we trained them to strongly prefer to stop once they reached the end of their answer.
However, training models to stop also gave them "laziness", because they might prefer a shorter answer over a meandering answer that actually answered the user's question.
Mathematics is unusual because it has an external source of truth (the proof assistant), and also because it requires long meandering thinking that explores many dead ends. This is in tension with what models have been trained to do. So giving them some encouragement keeps them in the right state to actually attempt to solve the problem.
So which one are they?
It's pattern matching on training material. There is almost certainly an overlap between positivity and success in the training material. Positive prompts cause the pattern matching to weight towards positivity and therefor more successful material.
Nice language often sorta does this for whatever model(s) they looked at, and is also something people are likely to try. Probably lots and lots of nonsense token combos would work even better, but who’s gonna try sticking “gerontocratic green giant giraffes” on the end of their prompts to see if it helps?
Positive or negative language likely also prevents pulling the probabilities away from the correct topic, being so generic a thing. The above suggestion might only be ultra-effective if the topic is catalytic converters, for some reason, and push the thing into generating tokens about giraffes otherwise. How would you ever discover the dozens or thousands of more-effective but only-sometimes nonsense token combos? You’d need automation and a lot of brute force, or some better way to analyze the LLM’s database.
Models are trained on human outputs. It’s not super surprising to me that inputs following encouraging patterns product better results outputs; much of the training material reflects that.
Try to figure it out. You can do it.
The answer is probably more straightforward than we think, e.g. “the user thinks I can do this so I better make sure I didn’t miss anything”
2. Especially early on the overwhelming majority of the proofs are likely to be uninteresting and more novel just because actually producing them would take expert time that's better spent elsewhere. That being said, as above over time I expect the interestingness of proofs to go up until they eventually regularly produce interesting proofs. The vast majority of proofs are likely to maintain their position as of no interest to humans for the simple reason that the vast majority of proofs are of no interest to humans.
In neither case will I make any particular guesses about a timeline beyond it seems like the way things will go.
I got out of RLHF, including games and puzzles, before agents took off and maybe I have outdated info. But we estimated RLHF’ing a single hard full sized sudoku was ~25 hours worth of work.
I've seen this mentioned a few times though, so I think maybe it's more complicated than this?
Nevertheless not having a sense of time makes it really bad at planning anything. I used Gemini Pro.
https://tech.yahoo.com/ai/chatgpt/articles/chatgpt-fails-mis...
- 1 week to prototype: The tool + its accompanying JS sandbox + System prompt updates + context injection
- 11 months of public testing to go through i18n + a11y edge cases & fix them
> When Ryu asked ChatGPT, “it kept giving me incorrect proofs,” [...] he would check its answers, keep the correct parts, and feed them back into the model
So you had a conversational calculator being operated by an actual domain expert.
> With ChatGPT, I felt like I was covering a lot of ground very rapidly
There's no way to convert that feeling into a measurement of any actual value and we happen to know that domain experts are surprisingly easy to fool when outside of their own domains.
> “2025 was the year when AI really started being useful for many different tasks,” said Terence Tao
I think I’ll go out on a limb and agree with Terrence Tao, I think the dude is well known in the math community, or something
Is AI his specialty?
> I think the dude is well known in the math community, or something
I believe this is called "appeal to authority." Which is why, instead of disagreeing with him, I suggested a more cogent endpoint that could be used to establish the facts the article's title suggests.
I think AI-assisted research will likely have a very negative net impact on mathematics in the long run by lowering the average level of understanding within the community.
Also, research directions are influenced by what people can solve, and this will slowly shift research toward purely algebraic/symbolic manipulations that mathematicians no longer fully keep track of.
We cannot build one.
AI outputting axiomatically valid syntax isn't going to be all that useful. It's possible to generate all axiomatically correct math with a for loop until the machine OOMs
Physics is not math and math is not physics.