Tao: Using test-time compute to train efficient LLMs without labeled data (opens in new tab)

(databricks.com)

29 pointschriskanan1y ago2 comments

2 comments

2 comments · 2 top-level

They're distilling a reasoning model, using a llama model as a base. But they're using RL instead of SFT:

  Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality.

I'm curious:

1. How do they determine 'closely aligned'?

2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?

aktsvigun1y ago

I'm curious how they evaluate the responses in the first place. This is the part replacing human annotation (which seems to be the cornerstone of their method) yet no detail is provided.

j / k navigate · click thread line to collapse