A surfboard is also an amazing tool, but there's more to operating one than telling it which way to go.
Many people want self-driving cars so they can drink in the back seat watching movies. They'll find their jobs replaced by AI, with a poor quality of life because we're a selfish species. In contrast Niki Lauda trusted fellow Formula 1 race car driver James Hunt to race centimeters apart. Some people want AI to help them drive that well. They'll have great jobs as AI evolves.
Gary Kasparov pioneered "freestyle" chess tournaments after his defeat by Big Blue, where the best human players were paired with computers, coining the "centaur" model of human-machine cooperation. This is frequently cited in the finance literature, where it is recognized that AI-guided human judgement can out-perform either humans or machines.
Any math professor knows how to help graduate students confidently complete a PhD thesis, or how to humiliate students in an oral exam. It’s a choice. To accomplish more work than one can complete alone, choose the former. This is the arc of human evolution: we develop tools to enhance our abilities. We meld with an abacus or a slide rule, and it makes us smarter. We learn to anticipate computations, like we’re playing a musical instrument in our heads. Or we pull out a calculator that makes us dumber. The role we see for our tools matters.
Programmers who actually write better code using AI know this. These HN threads are filled with despair over the poor quality of vibe coding. At the same time, Anthropic is successfully coding Claude using Claude.
You can surf the wave, but sooner or later, the wave will come crashing down.
I really don't see why that would necessarily be true. Any task that can be done by a human with a keyboard and a telephone is at risk of being done by an AI - and that includes the task of "translation and verification".
Chess is relatively simple in comparison, as complex as it is.
Nobody ever wins anymore in the ICCF championships (which I believe is the most prestigious centaur chess venue, but am not sure).
This is not an exaggeration. See my comment from several months ago: https://news.ycombinator.com/item?id=45768948
As far as I can tell based on scanning forums, to the extent humans contribute anything to the centaur setup, it is entirely in hardware provisioning and allocating enough server time before matches for chess engines to do precomputation, rather than anything actually chess related, but I am unsure on this point.
I have heard anecdotally from non-serious players (and therefore I cannot be certain that this reflects sentiment at the highest levels although the ICCF results seem to back this up) that the only ways to lose in centaur chess at this point is to deviate from what the computer tells you to do, either intentionally or unintentionally by accidentally submitting the wrong move, or simply by being at a compute disadvantage.
I've got several previous comments on this because this is a topic that interests me a lot, but the two most topical here are the previous one and https://news.ycombinator.com/item?id=33022581.
Magnus himself in 2015 said we’ve known for a long time that engines are much stronger than humans so the engine is not an opponent.
https://stockfishchess.org/blog/2026/stockfish-18/
https://www.dw.com/en/world-chess-champion-magnus-carlsen-th...
There is definitely a gap in academic tooling, where an "association engine" would be very useful for a variety of fields (and for encouraging cross-pollination of ideas between fields), but I don't think LLMs are anywhere near the frontier of what can be accomplished with a given amount of computing power. I would expect simpler algorithms operating over more explicit ontologies to be much more useful. (The main issue is that people haven't made those yet, whereas people have made LLMs.) That said, there's still a lot of credit due to the unreasonable effectiveness of literature searches: it only usually takes me 10 minutes a day for a couple of days to find the appropriate jargon, at which point I gain access to more papers than I know what to do with. LLM sessions that substitute for literature review tend to take more than 20 minutes: the main advantage is that people actually engage with (addictive, gambling-like) LLMs in a way that they don't with (boring, database-like) literature searches.
I think developing the habit of "I'm at a loose end, so I'll idly type queries into my literature search engine" would produce much better outcomes than developing the habit of "I'm at a loose end, so I'll idly type queries into ChatGPT", and that's despite the state-of-the-art of literature search engines being extremely naïve, compared to what we can accomplish with modern technology.
I also agree that neural net LLMs are not the inevitable way to implement AI. I'm most intrigued by the theoretical underpinnings of mathematical proof assistants such as Lean 4. Computer scientists understand the word problem for strings as undecidable. The word problem for typed trees with an intrinsic notion of induction is harder, but constructing proofs is finding paths in this tree space. Just as mechanical computers failed in base ten while at the same time Boole had already developed base two logic, I see these efforts merging. Neural nets struggle to simulate recursion; for proof assistants recursion is baked in. Stare at these tree paths and one sees thought at the atomic level, begging to be incorporated into AI. For now the river runs the other way, using AI to find proofs. That river will reverse flow.
While CoIC has recursion "baked in", HOL does not. It turns out that we can treat structural recursion as a derived property, even over coinductively-defined types. We don't even need a notion of ordinals for this! (See https://www.tcs.ifi.lmu.de/staff/jasmin-blanchette/card.pdf and https://matryoshka-project.github.io/pubs/bindings.pdf.)
Lean 2 used HoTT, which was theoretically interesting, but not enough was known about HoTT at the time (in particular, whether it was a constructive logic – I think we have all the pieces for an explicit construction via cubical type theory now, but I don't know that anyone's put the pieces together), so that direction has been mostly abandoned. I think there's useful work to be done in that direction, but with the current state of HoTT pedagogy, I doubt I'd ever be able to keep on top of it enough to contribute; and with Lean 4 taking so much of the funding, I don't think we'll see much work in this direction until HoTT is easier to learn.
I still think you're overgeneralising. What actual thing does your poetic tree / thought / river analogy correspond to?
"Throw a massive neural network at it" is an extremely inefficient way to get results, and doesn't generalise well – for instance, there's no easy way to get online learning for a transformer model, whereas that capability just falls out of most search engine database systems. (The underlying relational database engines had a lot of work put in to make online CRUD work reliably, but that work has been done now, and we can all build on top of it without a second thought.)
Instead, here you get questions that extremely famous mathematicians (Hairer, Spielman) are telling you (a) are solvable in <5 pages (b) do not have known solutions in the literature. This means that solutions from AI to these problems would perhaps give a clearer signal on what AI is doing, when it works on research math.
Claude is one of the buggiest pieces of shit I have ever used. They had to BUY the creators of bun to fix the damn thing. It is not a good example of your thesis.
Typing out solutions to problems was only part of the job description because there was no other way to code. Now we have a far better way.
Is that why everyone keeps complaining about the quality getting worse?
Can you share more about your architecture & process? Also a researcher involved in math research (though not strictly speaking a mathematician, but I digress). I've often thought about using AI on my notes, but they are messy and even then I can't quite figure out what to ask: prioritization, connecting ideas, lit search, etc.
I'd love to hear what you do.
> the answers are known to the authors of the questions but will remain encrypted for a short time.
Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?
Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?
Even if it's not Anthropic or OpenAI paying for the solutions, maybe it'll be someone solving them "for fun" because the paper got popular and posting them online.
It's a futile exercise.
Yep. "possible but unlikely" was my take too. As another person commented, this isn't really a benchmark, and as long as that's clear, it seems fair. My only fear is that some submissions may be AI-assisted rather than fully AI-generated, with crucial insights coming from experienced mathematicians. That's still a real achievement even if it's human + AI collaboration. But I fear that the nuance would be lost on news media and they'll publish news about the dawn of fully autonomous math reasoning.
It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.
It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.
If the paper would not have had the AI spin, would those 10 questions still have been interesting?
It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.
The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.
Not because of the problems, and not because this is new methodology.
And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.
So what do we really learn?
aaaaaaand no one cared enough to check
so i think the question is, are those interesting by themselves, or, are they just non interesting problems no one will ever care about except it would be indicative llms are good for solving complex novel problems that do not exists in their training set?
This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
FrontierMath did this a year ago. Where is the novelty here?
> a solution is known, but is guaranteed to not be in the training set for any AI.
Wrong, as the questions were poses to commercial AI models and they can solve them.
This paper violates basic benchmarking principles.
Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.
[1] https://github.com/MizarProject/system [2] https://github.com/digama0/mizar-rs [3] https://arxiv.org/pdf/2304.08391v2 [4] https://link.springer.com/article/10.1007/s10817-018-9479-z
> Mizar is indeed "scarcely advertised", but all the information is publicly available, who wants to know knows.
Yes, there indeed seems to be a good bit of information available, especially about the library and its articles. But some parts seem to be scattered about, unless you already know where to look, or know someone who knows. Perhaps it's a matter of taste.
(For comparison, I've recently been dabbling a fair bit with Metamath: it's not really advertised outside of its small circle these days, but the website does a good job at introducing the system, while also offering a complete reference in the form of the Metamath book. From there, the primary challenges to a new user are the fiddly tooling, the cryptic labeling scheme, and the puzzling DV conditions.)
Not going to spend too many more tokens on this.
The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.
My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.
The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"
Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.
Why the angst ? If the ai can autonomously solve these problems, isnt that a huge step forward for the field.
Is this known?
But as a non-mathematician I'm not following any of it. How many people are there who are willing to check the generative results? And how much effort is it for a human to check these? How quickly can you even identify math-slop?
Here's the generated proof:
https://github.com/w-m/firstproof_problem_10/blob/2acd1cea85...
I asked Opus 4.6 to look at all the problems and guess which it might be able to solve. It was, coincidentally, most keen on problem 10.
I asked it to try. (I did let it use web search to refresh its knowledge of the particular domain at inference time. Pretty sure that's not unfair compared to how a human expert acts.)
It expressed confidence it had solved it OK after a few minutes thought.
The solution was way beyond my pay-grade.
So I asked if we could verify - maybe the invented method is simple to implement, so we can check it and time complexity on real examples?
It went off and did that.
""" Net assessment: I'd now raise Problem 10 confidence from 85% to 90%.
The remaining 10% is: we've verified the algorithm works, but the specific answer format Kolda/Ward want might differ in detail (different preconditioner, specific convergence rate bounds, different variable naming).
The mathematical substance is solid.
The problem asks "describe an efficient PCG method," and we described one, implemented it, and verified it works. """
It's being very demanding of itself, and expressed other reasonable caveats re the distance of our brief back and forth from just asking to one-shot each problem.
""" The 8 problems I declined would have produced nonsense. Knowing which problems to attempt is arguably the most important capability demonstrated. """
(It reckoned problem 6 was worth attempting too, we didn't try it.)
Full conversation with the reasoning then generated solution and verification code:
https://claude.ai/public/artifacts/c3401a11-b5a8-4dc6-a72a-9...
As it should. Good.
This is a totally independent test not conducted or collaborated by any of the AI companies or employees so that no bias is introduced at all[0].
[0] Unless the researchers are not disclosing if they have any ownership of shares in private AI companies.