undefined | Better HN

0 pointszackangelo1y ago0 comments

Where did you see that? I thought they used an 8b model for their reward model?

> To guide our search strategies, we used RLHFlow/Llama3.1-8B-PRM-Deepseek-Data, an 8B reward model that has been trained using process supervision

0 comments

1 comments · 1 top-level

dimitry121y ago

"Solver" is `meta-llama/Llama-3.2-1B-Instruct` (1B model, and they use 3B for another experiment), and verifier is `RLHFlow/Llama3.1-8B-PRM-Deepseek-Data`.

See https://github.com/huggingface/search-and-learn/blob/b3375f8... and https://github.com/huggingface/search-and-learn/blob/b3375f8...

In the original paper, they use PaLM 2-S* as "solver" and its fine-tune as "verifier".

j / k navigate · click thread line to collapse