But a typical LLM has a feedback loop: it looks at the last token it generated and then decides, given the N tokens before that, which token to output next.
In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?
There are multiple ways to model reward. You can have it be fine-grained, such that every token gets its own reward, but by far the most common is to feed in the whole sequence and generate a single reward at the end.
It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.
You can check the examples from the TRL library for more information.