undefined | Better HN

0 points1024core1y ago0 comments

But a typical LLM has a feedback loop: it looks at the last token it generated and then decides, given the N tokens before that, which token to output next.

In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?

0 comments

3 comments · 1 top-level

maleldil1y ago· 2 in thread

There are multiple ways to model reward. You can have it be fine-grained, such that every token gets its own reward, but by far the most common is to feed in the whole sequence and generate a single reward at the end.

1024coreOP1y ago

I guess I'm not sure how the "feed in the whole sequence" works, if there's a single reward at the end.

maleldil1y ago

It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.

You can check the examples from the TRL library for more information.

1 more reply

j / k navigate · click thread line to collapse