undefined | Better HN

0 pointspizza1y ago0 comments

Seems like if you want to stay in the same language, you could just add a verifiable rewards term for that w/o having to fully load up on the baggage of a base model KL penalty.

0 comments

kcorbitt1y ago

Yep. And tbh you probably don't even have to do this; the R1 paper found that just running SFT the base model with a relatively small number of monolingual reasoning traces was enough for it to get the idea and iirc they didn't even bother selecting for language specifically in the RL training looop itself.

j / k navigate · click thread line to collapse

0 comments

kcorbitt1y ago

j / k navigate · click thread line to collapse