undefined | Better HN

0 pointsrao-v28d ago0 comments

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

0 comments

2 comments · 1 top-level

ACCount3728d ago· 1 in thread

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

rao-vOP28d ago

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

j / k navigate · click thread line to collapse