Speculating further, during decoding (when GPT is deciding which token to generate next) we typically use something like beam search. That is, we don’t just take the next most likely token, we take the next most likely
sequence of tokens when you multiply all of their probabilities together.
In Q learning, a learner can take some set of actions to maximize a future reward. In this case, the set of actions at each step is the choice of token and the reward is something like 1 if the user liked the response and 0 if they didn’t. Or since it seems they’re applying this to arithmetic, the goal is some formulation of the solution.
Putting these together, it’s possible that Q* is some better way of decoding. Something built on top of the prior probabilities of GPT.