They're distilling a reasoning model, using a llama model as a base. But they're using RL instead of SFT:
Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality.
I'm curious:
1. How do they determine 'closely aligned'?
2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?