TinyLoRA – Learning to Reason in 13 Parameters (opens in new tab)

(arxiv.org)

234 pointssorenjan3mo ago45 comments

45 comments

41 comments · 14 top-level

dollo_72mo ago· 7 in thread

Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.

But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.

sachaa2mo ago

Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.

endofreach2mo ago

> the gap between capability and behavior might be much smaller

Can you elaborate a bit on what you mean with the gap?

1 more reply

cheald2mo ago

I've done a lot of exploratory work with Stable Diffusion LoRAs, and I actually do buy that there's some juice here, though it's almost certainly not nearly as good as other techniques can be. In particular, this technique will likely avoid the intruder dimension problem which plagues naive LoRA. SVD is expensive, but you only have to do it once at the beginning of training.

I haven't done much research lately, but when I was working on it, I was having substantial success training an adapter of the form U_k @ P @ A, where U_k was the top k left singular vectors of the underlying weight, and then P and A were your typical LoRA projection matrices.

The 13 parameters are kind of misleading here; the real juice is going to be in the P_i fixed random matrices. My suspicion is that they are overfitting to the benchmark, but they almost certainly are observing a real gain in model capacity that is largely due to avoiding the intruder dimension problem.

robrenaud2mo ago

Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.

taneq2mo ago

In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.

nativeit2mo ago

How much did your training cost society?

1 more reply

sorenjanOP2mo ago

They're using the truncated SVD, not the full variant, that's computationally cheaper.

measurablefunc2mo ago· 7 in thread

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.

esafak2mo ago

Except learning to reason is a far cry from curve fitting. Our brains have more than five parameters.

voxelghost2mo ago

After a quick content browse, my understanding is this is more like with a very compressed diff vector, applied to a multi billion parameter model, the models could be 'retrained' to reason (score) better on a specific topic , e.g. math was used in the paper

sdenton42mo ago

It's the statistics equivalent of 'no one needs more than 640kb of RAM'

nativeit2mo ago

My very first PC was a Packard Bell with 640KB of RAM. If I’d known, I’d have saved all my RAM for retirement…

ekuck2mo ago

speak for yourself!

est2mo ago

reasoning capability might just be some specific combinations of mirror neurons.

even some advanced math usually evolves applying patterns found elsewhere into new topics

measurablefunc2mo ago

I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.

cestith2mo ago· 5 in thread

This is interesting and all, but “LoRA” is painfully close to “LoRa” (which is related to radio networking, not AI) when just scanning a list of topics. We’re never going to beat the Shannon limit on acronyms and initialisms.

I’m glad the rest of the anchor text gave some context.

sorenjanOP2mo ago

A version of this comment is posted in all submissions about Low Rank Adapters. I don't see how "Learning to reason in 13 parameters" would apply to low power radio communication, so it's even less relevant this time.

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

https://news.ycombinator.com/newsguidelines.html

cestith2mo ago

> I’m glad the rest of the anchor text gave some context.

I’m sorry if that reads like a complaint.

jorlow2mo ago

I see this comment on every single LoRA post despite the vast majority of posts being about LoRA not LoRa. Can we please stop beating this dead horse?

ticulatedspline2mo ago

Never heard of the radio thing. I suspect LoRA has already eclipsed LoRa in general usage. It's probably more appropriate to complain on a LoRa post that it's too close to LoRA.

HeyLaughingBoy2mo ago

And I'm the exact opposite. I never heard about LoRA, but I have used LoRa and was curious to see what it had to do with reasoning.

It's just an unfortunate name collision: disambiguating by use of capitals only works with computers.

a-t-c-g2mo ago· 3 in thread

The quality of custom models trained with proper reasoning datasets[0] even with small parameters (3-7B is sweet spot) is incredible now

[0]: cartesien.io or Salesforce's WebscaleRL

objektif2mo ago

What are you basing how good they are on? Personal experience or some benchmarks?

a-t-c-g2mo ago

Benchmarks, we have internal ones testing reasoning fine-tuned v/s frontier + prompts

For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability

objektif2mo ago

How can I learn more about these models? Are they open source?

1 more reply

matt1234567892mo ago· 3 in thread

Such low dimensionality of the LoRA vector must surely result in a close-to-linear modification to the KV calculation. This seems to me to imply that what we call "reasoning" is latent within the model. Pretty clear I didn't read the paper, I'm sure the authors address this.

a-t-c-g2mo ago

Yes - some degree of reasoning appears to be latent in the structure of language itself. But models trained explicitly on reasoning-focused data still perform better than models trained only on general corpora.*

*At least up to 300B parameters, based on the models we’ve tested.

crawfordcomeaux2mo ago

I wonder what the relationships between the grammar of a language, what it can compute, how it encodes, and what the minimal parameters/structure for reasoning looks like...

a-t-c-g2mo ago

natural language may provide part of the scaffolding for reasoning, but the capability itself seems to depend more on learned transformations over internal representations than on language alone

refs: https://arxiv.org/abs/2412.17819 https://arxiv.org/abs/2412.06769

MASNeo2mo ago· 1 in thread

Is it an Aprils Fools publication?

darkxanthos2mo ago

This hit me too hard.

Xx_crazy420_xX2mo ago· 1 in thread

If i understand it correctly, the analogy could be:

Let's say we have a low level programmer expert and we try to teach him algebra either we:

  - (SFT): give him algebra book with new nomenclature, definitions, syntax
  - (RL): let him learn algebra using C syntax

nathan_compton2mo ago

I don't think so.

Fine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.

RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.

In SFT you are rewarded for generating tokens plausible to the proof, one token at a time. In RL you are expected to generate an entire proof and then you are rewarded or punished only when the proof is done.

kashifr2mo ago

You can try out TinyLoRA in PEFT main now: https://huggingface.co/docs/peft/main/en/package_reference/t...

kgeist2mo ago

>One theory is that the knowledge required to solve the task is already stored in the parameters of the model, and only the style has to change for task success

>In particular, learning to generate longer outputs may be possible in few parameters

Reminded me of: https://arxiv.org/abs/2501.19393

>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps

Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model

5555watch2mo ago

It's not "13 parameters to reason", they just rotated the full 8B parameter space in 13 dimensions and found a rotation that was still able to reason.

Depending on the latent structure, it's possible a nice rotation that would be perfect for some one specific problem, but you still got to search for it, and it's not a guarantee to exist.

But it's a nice step towards LLM parameter-space interpretability.

vasco2mo ago

Most data in the training set of most reasoning models is crap I guess.

ashater2mo ago

Likely reasoning is part of the original model. It is well known that it is not possible to get a 1bn parameter model to reason, even with RL.

nekusar2mo ago

Can a model that small dynamically grow? In other words, can it train itself AS it progresses through the network?

sachaa2mo ago

If 13 parameters can unlock better reasoning, then we will not be "training" models, we'll be steering them. Most of the capability is already there.

The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.

j / k navigate · click thread line to collapse

45 comments

41 comments · 14 top-level

dollo_72mo ago· 7 in thread

Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.

sachaa2mo ago

endofreach2mo ago

> the gap between capability and behavior might be much smaller

Can you elaborate a bit on what you mean with the gap?

1 more reply

cheald2mo ago

robrenaud2mo ago

Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.

taneq2mo ago

In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.

nativeit2mo ago

How much did your training cost society?

1 more reply

sorenjanOP2mo ago

They're using the truncated SVD, not the full variant, that's computationally cheaper.

measurablefunc2mo ago· 7 in thread

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.

esafak2mo ago

Except learning to reason is a far cry from curve fitting. Our brains have more than five parameters.

voxelghost2mo ago

sdenton42mo ago

It's the statistics equivalent of 'no one needs more than 640kb of RAM'

nativeit2mo ago

My very first PC was a Packard Bell with 640KB of RAM. If I’d known, I’d have saved all my RAM for retirement…

ekuck2mo ago

speak for yourself!

est2mo ago

reasoning capability might just be some specific combinations of mirror neurons.

even some advanced math usually evolves applying patterns found elsewhere into new topics

measurablefunc2mo ago

I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.

cestith2mo ago· 5 in thread

I’m glad the rest of the anchor text gave some context.

sorenjanOP2mo ago

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

https://news.ycombinator.com/newsguidelines.html

cestith2mo ago

> I’m glad the rest of the anchor text gave some context.

I’m sorry if that reads like a complaint.

jorlow2mo ago

I see this comment on every single LoRA post despite the vast majority of posts being about LoRA not LoRa. Can we please stop beating this dead horse?

ticulatedspline2mo ago

Never heard of the radio thing. I suspect LoRA has already eclipsed LoRa in general usage. It's probably more appropriate to complain on a LoRa post that it's too close to LoRA.

HeyLaughingBoy2mo ago

And I'm the exact opposite. I never heard about LoRA, but I have used LoRa and was curious to see what it had to do with reasoning.

It's just an unfortunate name collision: disambiguating by use of capitals only works with computers.

a-t-c-g2mo ago· 3 in thread

The quality of custom models trained with proper reasoning datasets[0] even with small parameters (3-7B is sweet spot) is incredible now

[0]: cartesien.io or Salesforce's WebscaleRL

objektif2mo ago

What are you basing how good they are on? Personal experience or some benchmarks?

a-t-c-g2mo ago

Benchmarks, we have internal ones testing reasoning fine-tuned v/s frontier + prompts

For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability

objektif2mo ago

How can I learn more about these models? Are they open source?

1 more reply

matt1234567892mo ago· 3 in thread

a-t-c-g2mo ago

*At least up to 300B parameters, based on the models we’ve tested.

crawfordcomeaux2mo ago

I wonder what the relationships between the grammar of a language, what it can compute, how it encodes, and what the minimal parameters/structure for reasoning looks like...

a-t-c-g2mo ago

natural language may provide part of the scaffolding for reasoning, but the capability itself seems to depend more on learned transformations over internal representations than on language alone

refs: https://arxiv.org/abs/2412.17819 https://arxiv.org/abs/2412.06769

MASNeo2mo ago· 1 in thread

Is it an Aprils Fools publication?

darkxanthos2mo ago

This hit me too hard.

Xx_crazy420_xX2mo ago· 1 in thread

If i understand it correctly, the analogy could be:

Let's say we have a low level programmer expert and we try to teach him algebra either we:

  - (SFT): give him algebra book with new nomenclature, definitions, syntax
  - (RL): let him learn algebra using C syntax

nathan_compton2mo ago

I don't think so.

Fine tuning works on an input/output basis. You are rewarded for producing a plausible output _now_.

RL rewards you later for producing the right output now. So you have to learn to generate a lot of activity but you are only rewarded if you end up at the right place.

kashifr2mo ago

You can try out TinyLoRA in PEFT main now: https://huggingface.co/docs/peft/main/en/package_reference/t...

kgeist2mo ago

>One theory is that the knowledge required to solve the task is already stored in the parameters of the model, and only the style has to change for task success

>In particular, learning to generate longer outputs may be possible in few parameters

Reminded me of: https://arxiv.org/abs/2501.19393

Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model

5555watch2mo ago

It's not "13 parameters to reason", they just rotated the full 8B parameter space in 13 dimensions and found a rotation that was still able to reason.

Depending on the latent structure, it's possible a nice rotation that would be perfect for some one specific problem, but you still got to search for it, and it's not a guarantee to exist.

But it's a nice step towards LLM parameter-space interpretability.

vasco2mo ago

Most data in the training set of most reasoning models is crap I guess.

ashater2mo ago

Likely reasoning is part of the original model. It is well known that it is not possible to get a 1bn parameter model to reason, even with RL.

nekusar2mo ago

Can a model that small dynamically grow? In other words, can it train itself AS it progresses through the network?

sachaa2mo ago

If 13 parameters can unlock better reasoning, then we will not be "training" models, we'll be steering them. Most of the capability is already there.

The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.

j / k navigate · click thread line to collapse