- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.
- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.
To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.
There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:
* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393
You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.
But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.
It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.
I've seen the textbook analogy used, but to me it's like a very knowledgeable person reading an advanced textbook to become an expert. Then they say they're better than the other very knowledgeable persons because he read that manual, and everyone can start from scratch using it.
So there's nothing wrong with making a more efficient model from an existing one, the issue is concluding you don't need all the data that made the existing one possible in the first place. While that may be true, this is not how you prove it.
The information from the selection criteria isn't available to the model, just the chosen samples.
I'm not knocking the work. They report large improvements using relatively little data. That's good. But let's be clear that this is further training of a good sized LLM that has read far, far more than any human that ever lived already.
Most of the math competitions people are working on are high school math competitions - these have problems from a relatively small set of mathematics, so that high school students can reasonably know the appropriate background.
The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.
The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.
Sit near someone who "talks to herself" while doing problems and it's even more evident.
---
* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.
The 800+ training samples, each containing solutions with detailed reasoning steps, were primarily generated by DeepSeek r1 and advanced models. The reasoning processes within these training solutions are crucial. It's possible that the advanced models have encoded these reasoning processes through the generated samples. Given a sufficiently large model, it can effectively restore such reasoning weights, effectively adding a delta from DeepSeek r1, among others.
Therefore, it's not surprising that, with relatively few fine-tuning data, Qwen 2.5 has achieved such significant improvements.
This is merely a conjecture. Further research is needed to analyze and visualize the changes in network weights before and after fine-tuning.
Sorry, but I don't get the point of your comment as a whole, and of this part in particular. Yes, most human day-to-day conversations are quite predictable, but some people are still capable of generating original thoughts from time to time. And still, how is it related to the comment you are replying to?
Sounds like any textbook. (and generally the process of knowledge compression over generations that made us who we are)
The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.
[0]: https://www.anthropic.com/news/mapping-mind-language-model
Pattern identification and continuation can be applied to evaluate symbolic reasoning. You can see this in e.g. the semantics of a functional programming language if evaluation semantics are defined in terms of rewrite rules.
If you have a model which can convert a problem into language that's precise enough to start pattern matching to LLM-encoded generative programs that evaluate logical implications, you can get into a very interesting space. Autoregressive prediction can turn into symbolic progressive evaluation and calculation. The background LLM is still guiding choice of evaluation and goal seeking.
Reinforcing these evaluation rules seems like it should be doable without enormous corpora, as long as the base model already has enough meat on it to cleanly attach to the more precise language.
Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.
I do recall someone handcrafting the weights for a transformer and getting some sort of useful algorithm or computation going, so there's that.
Or even better, a simple programmable calculator and/or symbolic calculator.
1- LLMs can never generalize theorem proving
2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"
Not sure what is what anymore!
There's simply no way an LLM can even train on all of that because each bit of true expert knowledge necessarily comically underrepresented in any possible training set.
Another way to put this: most of students who study the lecture notes for their high school math already have it within them to get a gold on olympiad (the math itself is not more advance than their high school) but getting a high school kid to get gold on olympiad is hard. It might be something similar to P vs NP.
For skeptics in particular, you will be able to use a top tier llm and see: does this do the thing someone is claiming it doesn't do? It often will. If you look at recently submitted papers by skeptics you will see them making a claim about state of the art LLMs but then only test using versions from over a year ago (this has happened recently!^)
The way for you to be sure what is what is to just use the thing for yourself and decide what is true.
Perhaps in a domain like math a smallish number of math-specific reasoning steps will go a long way, but math itself also has many "sub-domains" (algebra, geometry, calculus, topology, etc) and AFAIK the techniques of one branch are only going to be useful in another to extent you can map the problem from one domain to another.
As an experiment, I hand built a VQA dataset of ~600 examples, which is a vanishingly small number compared to even rudimentary VQA datasets (which tend to be about 10k examples or more). However, I ensured that the dataset was broad and highly varied, and that the queries aggressively exercised both visual and textual understanding.
With only 600 training examples, I finetuned the base JoyCaption model in a handful of minutes and to my surprise, not only did it gain VQA abilities, it's able to generalize quite far outside of its training set. Even for concepts not in the original 800k caption data.
My hypothesis is that if the training data is varied enough, it forces the model to generalize. It isn't given enough examples of any given type of task to learn specialized circuitry for them, so its only option is to learn a broadly generalized set of circuitry. The data keeps it on its toes, so to speak.
Of course, this leans heavily on Llama's existing instruction (text-based) tuning, so it's starting off on good footing there. The surprising bit is being able to generalize so well to a new domain (vision) with so little data.
One caveat is that this model is highly unstable, and the accuracy of its responses is much worse than the accuracy of the base model. It's able to handle all of the tasks I've tested on it, but often requires a few retries to get it right.
Building these datasets is also tedious and intensive. I've yet to successfully train existing AIs to generate useful user queries/instructions/questions, either through prompting or finetuning. So it has to all be done by hand. And every answer was either written by me, or generated by an existing VLM and then edited by me to ensure perfect accuracy and adherence to the request. Since the queries are complex and challenging, this makes the work of writing those answers similarly challenging and time consuming.
As an aside: this training also seems to have broken Llama's alignment. I've had it be remarkably sassy in its responses, and it's much better at simulating more normal human responses.
Another trend I've noticed is that there are already 3 papers reporting similar findings by using Qwen-2.5-Instruct. Did they find something interesting on LLMs or something unique to Qwen-2.5-Instruct. I guess we need more experiment results to draw conclusions.
I believe that all this shows that pre-training stage already creates the representations needed for CoT reasoning, so they are very simple to uncover. Either with R1-Zero pure RL, or with few-shots SFT.
To see a World in a Grain of Sand
And a Heaven in a Wild Flower,
Hold Infinity in the palm of your hand
And Eternity in an hour. Come in under the shadow of this impure rock
And I will show you something different from either
Your shadow at morning striding behind you
Or your shadow at evening rising to meet you;
I will show you wisdom in a handful of sand.Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?
https://i.imgur.com/CBmMSqO.png, perhaps
> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.
A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.
That said, it seems like a promising area of research:
> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.
AlphaGo came before AlphaGo Zero; it was trained on human games, then improved further via self-play. The later AlphaGo Zero proved that pre-training on human games was not necessary, and the model could learn from scratch (i.e. from zero) just via self-play.
For DeepSeek-R1, or any reasoning model, training data is necessary, but hard to come by. One of the main contributions of the DeepSeek-R1 paper was describing their "bootstrapping" (my term) process whereby they started with a non-reasoning model, DeepSeek-V3, and used a three step process to generate more and more reasoning data from that (+ a few other sources) until they had enough to train DeepSeek-R1, which they then further improved with RL.
DeepSeek-R1 Zero isn't a self-play version of DeepSeek-R1 - it was just the result of the first (0th) step of this bootstrapping process whereby they used RL to finetune DeepSeek-V3 into the (somewhat of an idiot savant - one trick pony) R1 Zero model that was then capable of generating training data for the next bootstrapping step.
Thankfully for mathematics and code this seems plausible due to automated theorem proving.
I mean, you wouldn't use this brand of AI to plot your path to Mars. Well, you could, BUT you'll also want to validate the path or risk dying.
But this AI is good enough for Elon and his ilk. Because Elon's not gonna get into the capsule, you are.
Because you are not the master of this AI, you are the validator.
the word reasoning has been subverted by those pushing these llms, and we all have bought-in. quite a magic trick this illusionist has pulled on us.
There's gonna come a time when Elon's gonna tell the AI to tell us to push all the buttons, just to see if we'll do it. And I'm pretty sure we will.