But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
Can you elaborate a bit on what you mean with the gap?
I haven't done much research lately, but when I was working on it, I was having substantial success training an adapter of the form U_k @ P @ A, where U_k was the top k left singular vectors of the underlying weight, and then P and A were your typical LoRA projection matrices.
The 13 parameters are kind of misleading here; the real juice is going to be in the P_i fixed random matrices. My suspicion is that they are overfitting to the benchmark, but they almost certainly are observing a real gain in model capacity that is largely due to avoiding the intruder dimension problem.
>In particular, learning to generate longer outputs may be possible in few parameters
Reminded me of: https://arxiv.org/abs/2501.19393
>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps
Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model
even some advanced math usually evolves applying patterns found elsewhere into new topics
Depending on the latent structure, it's possible a nice rotation that would be perfect for some one specific problem, but you still got to search for it, and it's not a guarantee to exist.
But it's a nice step towards LLM parameter-space interpretability.
I’m glad the rest of the anchor text gave some context.
> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
I’m sorry if that reads like a complaint.
It's just an unfortunate name collision: disambiguating by use of capitals only works with computers.
Let's say we have a low level programmer expert and we try to teach him algebra either we:
- (SFT): give him algebra book with new nomenclature, definitions, syntax
- (RL): let him learn algebra using C syntaxFine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.
RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.
In SFT you are rewarded for generating tokens plausible to the proof, one token at a time. In RL you are expected to generate an entire proof and then you are rewarded or punished only when the proof is done.
[0]: cartesien.io or Salesforce's WebscaleRL
For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability
*At least up to 300B parameters, based on the models we’ve tested.
refs: https://arxiv.org/abs/2412.17819 https://arxiv.org/abs/2412.06769
The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.