> First, the problems were manually translated into formal mathematical language for our systems to understand.
The non-geometry problems which were solved were all of the form "Determine all X such that…", and the resulting theorem statements are all of the form "We show that the set of all X is {foo}". The downloadable solutions from https://storage.googleapis.com/deepmind-media/DeepMind.com/B... don't make it clear whether the set {foo} was decided by a human during this translation step, or whether the computer found it. I want to believe that the computer found it, but I can't find anything to confirm. Anyone know?
> However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
So didn't actually solve autoformalization, which is why they still needed humans to translate the input IMO 2024 problems.
The reason why formalization is harder than you think is that there is no way to know if you got it right. You can use Reinforcement Learning with proofs and have a clear signal from the proof checker. We don't have a way to verify formalizations the same way.
To you or me, sure. But I think the proof that it isn't for this AI system is that they didn't do it. Asking a modern LLM to "translate" something is a pretty solved problem, after all. That argues strongly that what was happening here is not a "translation" but something else, like a semantic distillation.
If you ask a AI (or person) to prove the halting problem, they can't. If you "translate" the question into a specific example that does halt, they can run it and find out.
I'm suspicious, basically.
> While the problem statements were formalized into Lean by hand, the answers within the problem statements were generated and formalized by the agent.
However, it's unclear what initial format was given to the agents that allowed this step
To me, this sounds like Alphaproof receives a "problem", whatever that is (how do you formalize "determine all X such that..."? One is asked to show that an abstract set is actually some easily understandable set...). Then it generates candidate Theorems, persumably in Lean. I.e. the set is {n: P(n)} for some formula or something. Then it searches for proofs.
I think if Alphaproof did not find {foo} but it was given then it would be very outrageous to claim that it solved the problem.
I am also very hyped.
> AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct.
still good progress nonetheless. won't call the system sufficient by itself tho.
Then he gave me a huge hint to the solution, after which it only took me a couple of hours to solve.
(Formalizing the solution is of course the hardest part, and might serve as a good masters dissertation I think)
But seriously, if you can't ask the LLM to solve the right question, you can't really expect it to give you the right answer unless you're really lucky. "I'm sorry, but I think you meant to ask a different question. You might want to check the homework set again to be sure, but here's what I think you really want."
Some people call this programming
Lean isn't just a formal language, it is also a theorem prover, Could the IMO participants use the nlinarith tactic? Could they use other tactics?
Of course not, they had to show their work!
Could they have mathematicians translate the problem statements into the formal language for them?
Of course not, they had to do it themselves. In "How to solve it" Polya stresses multiple times that formalizing the initial question is an important part of the process.
Then, the actual computational resources expressed in time are meaningless if one has a massive compute cloud.
I'm a bit dissatisfied with the presentation, same as with the AlphaZero comparison to an obsolete Stockfish version that has been debunked multiple times.
That means that "AI solves IMO problems better than 75% of the students", which is probably even more impressive.
But, "minutes for one problem and up to 3 days for each remaining problem" means that this is unfortunately not a true representation either. If these students were given up to 15 days (5 problems at "up to 3 days each") instead of 9h, there would probably be more of them that match or beat this score too.
It really sounds like AI solved only a single problem in the 9h students get, so it certainly would not be even close to the medals. What's the need to taint the impressive result with apples-to-oranges comparison?
Why not be more objective and report that it took longer but was able to solve X% of problems (or scored X out of N points)?
Time is not a very interesting dimension here, because humans don't use the same CPU as huge GPU clusters. The binary "is it able to reach a solution given enough resources?" is more interesting (for GPT/Claude the answer is a clear negative).
If these problems were important to solve, redundantly by thousands or millions of people (like the real work that most people do), far more people would put in the effort to learn how to solve these problems.
It's just a weird comparison. Contests are very artificially structured in ways that don't make sense for comparing to computers.
Now, you may say how time is not a useful dimension here, but really, this is where we are seeing a lot of these advances come from: general researchers today do get access to huge compute capability, allowing them to quickly iterate on different ideas. In a sense, they can be less smart about their use of resources and simply try things out: this does drive the innovation (compared to waiting for their turn on a supercomputer nearby).
And finally, time is essential even for humans: given a couple hundred years, they will find the proof for Fermat's last theorem, but they might not do it in 4.5h. Since we are comparing AI capabilities to humans in the article, it's very possible that increased compute will never allow AI to find novel proofs we have not come up with either. That's where the AI bit comes in: we know that brute searching through the entire possible space of proofs is still too expensive for our compute capabilities, so we need AI to emulate our "intuition" when looking for the direction to narrow down the search.
So there are really two reasons time matters: 1. getting enough of compute might still be far away (heck, prime factorization and elliptic curves are still the basis of the most of world cryptography for that reason) and 2. maybe it's not even enough to increase compute capability to make huge jumps in problem solving capabilities (iow, maybe we are reaching maximum of where the approach can take us).
In other words model which can solve any/most school / college exam problem isn't necessarily smart. It can be just a database, or, in fact, a lookup table. Smarter version can be a lookup + 1 step test. Not saying it's bad, but it doesn't scale to less formalized domains. BTW, in this case formalization was done by humans.
In case this confuses anyone: the high school students in question are not a standard sample of high school students. AFAIK, they are teams of the ~6 strongest competitive problem solving high school students from each country.
Most of DeepMind’s research is a cost-centre for the company. These press releases help justify the continued investment both to investors and to the wider public.
The effect of establishing oneself as the thought leader in a field is enormous.
For example, IBM's stock went up 15% the month after they beat Kasparov.
I'd say "welcome to the web" but this was true in 1800s newspapers as well.
By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.
Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".
If this approach scales it'll be far more important than any other ML application that has come out in the last few years.
The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).
Don't dismiss search, it might be brute force but it goes beyond human level in Go and silver at IMO. Search is also what powers evolution which created us, also by a lot of brute forcing, and is at the core of scientific method (re)search.
That makes it, in principle, similar or even easier than a champsionship-level chess move, which often take more than 1 hour for a professional human (with more training than an IMO high school student) to solve.
Another interesting concern is that when posing a problem to humans, it's fine to pose an "easy" brute-forceable problem, but humans, being slow brute-searchers, need to find more clever solutions. But if you give such a problem to a computer, it can trivialize it. So to test a computer, you need to pose non- easily-brute-forceable problems, which are harder for the computer than the others, but equally difficult for the humans as the other problems are.
The issue is that to find solutions for useful problems you're often searching through highly complex and often infinite solution spaces.
You don't need to be able to prove very hard problems to do useful work. Proving just simple things is often enough. If I ask a language model to complete a task, organize some entries in a certain way, or schedule this or that, write a code that accomplishes X, the result is typically not trustworthy directly. But if the system is able to translate parts of the problem to logic and find a solution, that might make the system much more reliable.
But MCTS was always promising when married to large NNs and DeepMind/Brain were always in front on it.
I don’t know who fucked up on Gemini and it’s concerning for Alphabet shareholders that no one’s head is on a spike. In this context “too big to fail” is probably Pichai.
But only very foolish people think that Google is lying down on this. It’s Dean and Hassabis. People should have some respect.
This is important for more than Math problems. Making ML models wrestle with proof systems is a good way to avoid bullshit in general.
Hopefully more humans write types in Lean and similar systems as a much way of writing prompts.
People want always knock generative AIs for not being able to reason, and we've had automated systems that reason perfectly well for decades, but for some reason that doesn't count as AI to people.
In my understanding, proofs are usually harder to transcribe into Lean which is nobody _writes_ proofs using Lean.
What is a nlinarith?
has the example:
0 ≤ x^2 if x : ℝ
which humans simply use without proof. The IMO doesn't challenge participants to prove everything, only the main ideas.yes, it is true, but getting to the country specific team is itself an arduous journey, and involves brutal winnowing every step of the way f.e. regional math-olympiad, and then national math-olympiad etc.
this is then followed by further trainings specifically meant for this elite bunch, and maybe further eliminations etc.
suffice it to say, that qualifying to be in a country specific team is imho a big deal. getting a gold/silver from amongst them is just plain awesome !
Source: a friend who got silver on the IMO
I found the work environment to be more important than the subject when it comes to work satisfaction. If you are working on a world changing subject with a team of assholes, you are going to have a bad time, some people really have a skill for sucking the fun out of everything, and office politics are everywhere, especially on world changing subjects.
On the other hand, you can have a most boring subject, say pushing customer data to a database, and have the time of your life: friendly team, well designed architecture, time for experimentation and sharing of knowledge, etc... I have come to appreciate the beauty of a simple thing that just works. It is so rare, maybe even more rare than scientific breakthroughs.
Now, you can also have an awesome work environment and an awesome subject, it is like hitting the jackpot... and a good reason to be envious.
Pretty much all the top AI labs are both intensely competitive and collaborative. They consist of many former IMO and IOI medalists. They don't believe in remote work, either. Even if you work at Google DeepMind, you really need to be in London for this project.
Yet no one cares. Everyone's busy watching Magnus Carlsen.
We are human. This means we care about what other humans do. We only care about machines insofar as it serves us.
This principle is broadly extensible to work and art. Humans will always have a place in these realms as long as humans are around.
The proof will only have value if it's meaningful to us.
Nah, as a consumer it makes no difference to me if a meat packing factory or Amazon warehouse employs 5000 or 5 people. To art, this principle is totally real, but for work, it only applies to some/most of it.
Imagine a machine doing "work" that only serves itself and other machines that does no service to humanity. It would have no economic value. In fact the whole concept of "work" only makes sense if it is assigned economic value by humans.
Actually, I was looking up Elo ratings of the top computer chess players, and learned that it is not that trivial to compare these, due to differences in hardware requirements and whatnot.
Fischer was probably the last great player who was unassisted by tools.
IMHO, the largest contributors to AlphaProof were the people behind Lean and Mathlib, who took the daunting task of formalizing the entirety of mathematics to themselves.
This lack of formalizing in math papers was what killed any attempt at automation, because AI researcher had to wrestle with the human element of figuring out the author's own notations, implicit knowledge, skipped proof steps...
This seems so weird to me - AGI is undefined as a term imo but why would you expect "producing something generally intelligent" (i.e. median human level intelligence) to be significantly harder than "this thing is better than Terrence Tao at maths"?
On the other hand, navigating the real world mostly consists in employing a ton of heuristics we are still kind of clueless about.
At the end of the day, we won't know before we get there, but I think my reasons are compelling enough to think what I think.
In the meantime, while DeepBlue beat the world chess champion, Kasparov, at chess, our best efforts at generalism - LLMs than many (not me!) think are the path to AGI - struggle to play tic tac toe.
"Better than Terrence Tao at solving certain formalized problems" (not necessarily equal to "Better that Terrence Tao at maths) isn't.
Note that the 6th question is generally the hardest (“final boss”) and many top performers couldn’t solve it.
I don’t know what Lean is or how see AI’s proofs but an AI system that can explain such a question on par with the YouTuber above would be fantastic!
"... We'll be bringing all the goodness of AlphaProof and AlphaGeometry 2 to our mainstream #Gemini models very soon. Watch this space!" -- Demis Hassabis, CEO of Google DeepMind. https://x.com/demishassabis/status/1816499055880437909
Three days is interesting... Not technically silver medal performance I guess, but let's be real I'd be okay waiting a month for the cure to cancer.
As opposed to 0.999999% of the human population who can't do it even if their life depends on it?
Then they can train a network to evaluate intermediate positions (score network) and one to suggest things to try next (policy network).
So they had three days to keep training the model, on synthetic variations of each IMO problem.
Also, there's no reason to affirm that an eventual cure for cancer requires fundamentally new methods. Maybe the current methods are sufficient, it's just that nobody has been "smart" enough to put the pieces together. (disclaimer: not an expert at all)
But even if they aren't as fast as humans yet this is very valuable. Both as a stepping stone, and because at a certain scale compute is much easier to scale than skilled mathematicians.
So either (1) "within minutes" was underselling the abilities of the system, or (2) what they actually meant was that the geometry problem was solved in 19 seconds, one of the others "within minutes" (I'd guess #1 which is definitely easier than the other two they solved), and the others in unspecified times of which the longer was ~3 days.
I'd guess it's the first of those.
(Euclidean geometry has been a kinda-solved domain for some time; it's not super-surprising that they were able to solve that problem quickly.)
As for the long solve times, I would guess they're related to this fascinating remark:
> The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
Also, for what it's worth, I'm pretty sure that I wouldn't have been able to solve it myself in three days, even if I had access to all of GCP, Azure and AWS (except if I could mine crypto to then pay actual IMO-level mathematicians to solve it for me).
How many cycles of guess-and-check did it take over the course of three days to get the right answer?
If the IMO contestants were allowed to use theorem provers and were given 3 days (even factoring in sleep) would AlphaProof still have gotten silver?
> let's be real I'd be okay waiting a month for the cure to cancer.
I don't think these results suggest that we're on the brink of knowledge coming at a substantially faster rate than before. Humans have been using theorem provers to advance our understanding for decades. Now an LLM has been wired up to one too, but it still took 8x as long to solve the problems as our best humans did without any computer assistance.
First of all, this is not a sport and the point is not to compare AI to humans. The point is to compare AI to IMO-difficulty problems.
Secondly, this is now some hacky trick where Brute force and some theorem prover magic are massaged to solve a select few problems and then you'll never hear about it again. They are building a general pipeline which turns informal natural lamguage mathematics (of which we have ungodly amounts available) into formalized mathematics, and in addition trains a model to prove such kinds of mathematics. This can also work for theory building. This can become a real mathematical assistant that can help a mathematician test an argument, play with variations of a definition, try 100 combinations of some estimates, apply a classic but lengthy technique etc. etc.
> it still took 8x as long to solve the problems as our best humans did without any computer assistance.
Give it a year and that ratio will be reversed. At least. But also it matters less how long it takes if doubling the number of things reasoning at a best-human level is pronounced "ctrl-c, ctrl-v".
Which means these problems are trivial to solve if you have a computer - you can simply check all possibilities. And is precisely the reason why calculators aren't allowed.
But exhaustive searches are not feasible by hand in the time span the problems are supposed to be solved - roughly 30 minutes per problem. You are not supposed to use brute force, but recognize a key insight which simplifies the problem. And I believe even if you did do an exhaustive search, simply giving the answer is not enough for full points. You would have to give adequate justification.
Since I think you need an account to read threads now, here's a transcript:
Google DeepMind have produced a program that in a certain sense has achieved a silver-medal peformance at this year's International Mathematical Olympiad.
It did this by solving four of the six problems completely, which got it 28 points out of a possible total of 42. I'm not quite sure, but I think that put it ahead of all but around 60 competitors.
However, that statement needs a bit of qualifying.
The main qualification is that the program needed a lot longer than the human competitors -- for some of the problems over 60 hours -- and of course much faster processing speed than the poor old human brain.
If the human competitors had been allowed that sort of time per problem they would undoubtedly have scored higher.
Nevertheless, (i) this is well beyond what automatic theorem provers could do before, and (ii) these times are likely to come down as efficiency gains are made.
Another qualification is that the problems were manually translated into the proof assistant Lean, and only then did the program get to work. But the essential mathematics was done by the program: just the autoformalization part was done by humans.
As with AlphaGo, the program learnt to do what it did by teaching itself. But for that it needed a big collection of problems to work on. They achieved that in an interesting way: they took a huge database of IMO-type problems and got a large language model to formalize them.
However, LLMs are not able to autoformalize reliably, so they got them to autoformalize each problem many times. Some of the formalizations were correct, but even the incorrect ones were useful as training data, as often they were easier problems.
It's not clear what the implications of this are for mathematical research. Since the method used was very general, there would seem to be no obvious obstacle to adapting it to other mathematical domains, apart perhaps from insufficient data.
So we might be close to having a program that would enable mathematicians to get answers to a wide range of questions, provided those questions weren't too difficult -- the kind of thing one can do in a couple of hours.
That would be massively useful as a research tool, even if it wasn't itself capable of solving open problems.
Are we close to the point where mathematicians are redundant? It's hard to say. I would guess that we're still a breakthrough or two short of that.
It will be interesting to see how the time the program takes scales as the difficulty of the problems it solves increases. If it scales with a similar ratio to that of a human mathematician, then we might have to get worried.
But if the function human time taken --> computer time taken grows a lot faster than linearly, then more AI work will be needed.
The fact that the program takes as long as it does suggests that it hasn't "solved mathematics".
However, what it does is way beyond what a pure brute-force search would be capable of, so there is clearly something interesting going on when it operates. We'll all have to watch this space.
Or if AlphaProof used more compute they could have slashed that time to 1/10 or less. It's arbitrary as long as we don't define what is the compute the AI should be entitled to use here.
I wonder if this new model could be integrated with an LLM somehow? I get the feeling that combining those two powers would result in a fairly capable programmer.
Also perhaps a LLM could do the translation step that is currently manual?
While there are certainly some huge jumps in compute power, theory of data transformation and availability of data to transform, it would surprise me if computers in a 100 years do not still rely on a combination of well-defined and well-understood algorithms and AI-inspired tools that do the same thing but on a much bigger scale.
If not for any other reason, then because there are so many things where you can easily produce a great, always correct result simply by doing very precise, obvious and simple computation.
We've had computers and digital devices for a long while now, yet we still rely heavily on mechanical contraptions. Sure, we improve them with computers (eg. think brushless motors), but I don't think anyone would be surprised today about how did anyone design these same devices (hair dryers, lawn mowers, internal combustion engines...) before computers?
Also it's making me think that in 5-10 years almost all tasks involving computer scientists or mathematicians will be done in AI. Perhaps people going into trades had a point.
Is it clear whether the algorithm is actually learning from why previously attempted solutions failed to prove out, or is it statistically generating potential answers similar to an LLM and then trying to apply reasoning to prove out the potential solution?
If the data is synthetic and covers a limited class of problems I would imagine what it's doing mostly reduces to some basic search pattern heuristics which would be of more value to understand than just being told it can solve a few problems in three days.
Wonder what "great promise" entails. Because it's hard to imagine Gemini and other transformer-based models solving these problems with reasonable accuracy, as there is no elimination of hallucination. At least in the generally available products.
They explicitly stated that to achieve the current results, they had to manually translate the problem statements into formal mathematical statements:
> First, the problems were manually translated into formal mathematical language for our systems to understand.
How I understand what they're saying is that they used gemini to translate the problem statement into formal mathematical language and let DeepMath do it's magic after that initial step.
The article claims they have another model that can work without formal languages, and that it looks very promising. But they don't mention how well that model performed. Would that model also perform at silver medal level?
Also note, that if the problems are provided in a formal language, you can always find the solution in finite amount of time (provided the solution exists). You can brute-force over all possible solutions until you find the solution that proofs the statement. This may take a very long time, but it will find the solutions eventually. You will always solve all the problems and win the IMO at gold medal level. Alphaproof seems to do something similar, but takes smarter decisions which possible solutions to try and which once to skip. What would be the reason they don't achieve gold?
I know speed is just a matter of engineering, but looks like we still have a ways to go. Hold the gong...
(It's not my main point, but it's always worth remembering - even aside from any AI context - that many top mathematicians can't do IMO-type problems, and many top IMO medalists turn out to be unable to solve actual problems in research mathematics. IMO problems are generally regarded as somewhat niche.)
Combinatorics problems are usually simple enough that anyone can understand and try tackling it though, and the solutions in IMO are usually designed to be elegant. I don't think I've ever seen a bad combo problem before.
I wonder if some class of problems will emerge that human competitors are able to solve but are particularly tricky for machines. And which characteristics these problems will have (e.g. they'll require some sort of intuition or visualization that is not easily formalized).
Given how much of a dent LLM's are already making on beginner competitions (AtCoder recently banned using them on ABC rounds [1]), I can't help but think that soon these competitions will be very different.
I feel why combinatoric is harder for AI models is the same reason why LLM's are not great at reasoning anything out of distribution. LLM's are good pattern recognizers and fascinating at this point. But simple tasks like counting intersections at the Venn diagrams requires more strategy and less pattern recognition. Pure NN based models seem won't be enough to solve these problems. AI agents and RL are promising.
I don't know anything about lean but I am curious that proof of combinatorial problems can be as well represented as number theory or algebra. If combinatorial problem solutions are always closer to natural language, the failure of LLMs are expected. Or, at least we can assume it might take more time to make it better. I am making assumption in here that solutions of combinatorial problems in IMO are more human language oriented and relies on more common sense/informal logic when it compared to geometry or number theory problems.
Anyone know any details?
>because of limitations in reasoning skills and training data
One would assume that mathematical literature and training data would be abundant. Is there a simple example that could help appreciate the Gemini bridge layer mentioned in the blog which produces the input for RL in Lean?
The link you posted has problems with a dificulty between the first and second round that are much easier.
I took a quik look at the recent list of problems in the first and second round. I expect this new AI to get a solid 50/50 points in this test.
I wonder because on one hand they seem very impressive and groundbreaking, on the other it’s hard to imagine why more than a handful of researchers would use them
If you could automatically prove that your concurrency protocol is safe, or that your C program has no memory management mistakes, or that your algorithm always produces the same results as a simpler, more obviously correct but less optimized algorithm, I think that would be a huge benefit for many programmers.
Here, from what I understand, it's instead a theorem prover + LLM backing it. General proofs have a much larger search space than the 2d geometry problems you see on IMO; many former competitors disparage geometry for that reason.
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. … Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness.
Edit: To defend my response, the model definitely knows when it hasn't yet found a correct response, but this is categorically different from knowing that it does not know (and of course monkeys and typewriters etc., can always find a proof eventually if one exists).
For example, suppose a computer is asked to prove the sum of two even numbers is an even number. It could pull up its list of “things it knows about even numbers”, namely that an even number modulo 2 is 0. Assuming the first number is “a” and the second is “b”, then it knows a=2x and b=2y for some x and y. It then knows via the distributive property that the sum is 2(x+y), which satisfies the definition of an even number.
What am I missing that makes this problem so much harder than applying a finite and known set of axioms and manipulations?
This makes it natural address these problems using similar techniques, which is what this research team did. The "magic" in their solution is the use of neural nets to make good guesses about which branches of these massive search trees to explore, and make good guesses about how good any particular branch is even before they reach the end of the branch. These tricks let them (massively) reduce the effective branching factor and depth of the search trees required to produce solutions to math problems or win board games.
The thing is that most math "problems" are not solved not becasue they're hard, but because they're not interesting enough to even be discovered by humans.
That is more than half the work of solving them. Headline should read "AI solves the simple part of each IMA problem at silver medal level"
Real, exact, quotes from the top comments at 1 PM EST.
"I want to believe that the computer found it, but I can't find anything to confirm."
"Curing cancer will require new ideas"
"Maybe they used 10% of all of GCP [Google compute]"
So it failed at the first step (comprehension) and hence I think we can request a better effort next time.
(probably confused by version numbers)
IMO contestants aren't allowed to bring in paper tables, much less a whole theorem prover. They're given two 4.5 hour sessions (9 hours total) to solve all the problems with nothing but pencils, rulers, and compasses [0].
This model, meanwhile, was wired up to a theorem proover and took three solid days to solve the problems. The article is extremely light on details, but I'm assuming that most of that time was guess-and-check: feed the theorem prover a possible answer, get feedback, adjust accordingly.
If the IMO contestants were given a theorem prover and three days (even counting breaks for sleeping and eating!), how would AlphaProof have ranked?
Don't get me wrong, this is a fun project and an exciting result, but their comparison to silver medalists at the IMO is just feeding into the excessive hype around AI, not accurately representing its current state relative to humanity.
[0] 5.1 and 5.4 in the regulations: https://www.imo-official.org/documents/RegulationsIMO.pdf
A theorem prover is probably more useful for the typical IMO problem than for the typical real research problem, but even so I'd guess that even with a reasonable amount of training most IMO contestants would not do much better for having access to a theorem prover.
Having three days would be a bigger deal, for sure. (But from "computers can't do this" to "computers can do this, but it takes days" is generally a much bigger step than from "computers can do this, but it takes days" to "computers can do this in seconds".)
In 2016, machines defeated a World Go Champion for the first time, using a clever form of "dumb search" that leverages compute, DNNs, reinforcement learning (RL), and self-play. Critics noted that while this fancy form of "dumb search" worked for Go, it might not necessarily be a general strategy applicable to other cognitive tasks.[a]
In 2024, machines solved insanely hard math problems at the Silver Medal level in an International Math Olympiad for the first time, using a clever form of "dumb search" that leverages compute, DNNs, RL, and a formal language. Perhaps "dumb search" over cleverly pruned spaces isn't as dumb as the critics would like it to be?
---
[a] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Today's announcement is also not about proving Lean theorems by "dumb search". The success is about search + neural networks.
You're attacking critics for criticizing the solution that has failed, while confusing it for the solution that works to this day.
And you keep shifting the goalpost on what's called "dumb" here.
It kinda had to be this way, I think. There's a something from nothing problem. Douglas Adams brilliantly starts at this point.
We don't understand something from nothing. We don't even have the language to describe it. Concepts like "complexity" are frustratingly resistant to formalization.
"There is no free will." Has recently resurged as a philosophical position... like it did in response to Newton's mechanics.
Matter from void. Life from minerals. Consciousness from electrons. Free will from deterministic components. Smart reasoning & intelligent rationalisation from dumb search, computation, DNNs and such.
I don't think this debate was supposed to be ended by anything short of empirical demonstration.
Endnote: deep blue's victory over Gary was a bunch of preprogrammed bulls--t. Rematch!
Then, should we expect less with mathematics where written language is the normal way knowledge is propagated, and where formal proofs are wanted? An important distinction here is the coupling of search (not LLM for this one), a formal math language, and theorem proving. Math intelligence may not be merely the math written corpus, but adding the formal language and theorem proving sounds pretty powerful.
All this still lacks self-directed goals. An intention. For now that's taken care of by the human asking questions.
> In the official competition, students submit answers in two sessions of 4.5 hours each. Our systems solved one problem within minutes and took up to three days to solve the others.
Why not compare with students who are given 3 days to submit an answer ? /s
Not opensourcing anything.
This is a dead end on which no further research can be built.
It violates pretty much every principle of incremental improvement on which science is based. It's here just for hype, and the 300+ comments prove it.
[...]
"IMO medalists have also gone on to become notable computer scientists. The following IMO medalists have received a Nevanlinna Prize, a Knuth Prize, or a Gödel Prize; these awards recognise research in theoretical computer science."
https://en.wikipedia.org/wiki/List_of_International_Mathemat...
[1]: https://www.aeaweb.org/articles?id=10.1257/aeri.20190457
There are so many competitions that don't have any obvious practical significance. And people are still enjoying competitions where AI completely pwns humans.
Also, this is probably a good time to ask whether you won the Putnam... https://news.ycombinator.com/item?id=35079
Got to be kidding me. We are fucked
I don't really find that this impressive. With enough compute you could just do n-of-10,000 LLM generations to "brute force" a difficult problem and you'll get there eventually.
you realize this holds true for all of math right? outside of godel incompleteness potholes every proof/theorem is a permutation of ZFC. And you can fix the potholes by just filling them in with more Cs.
Sounds perfect for a GPT model, with lots of input training data (problem books and solutions).
The proofs of these problems aren't interesting. They were already known before the AI started work.
What's interesting is how the AI found the proof. The only answer we have is "slurped data into a neural network, matched patterns, and did some brute search".
What were the ideas it brainstormed? What were the dead-end paths? What were the "activations" where the problem seemed similar to a certain piece of input, which led to a guess of a step in the solution?
And, how much CO2 was released into earths atmosphere?