No?
> I think this submission paper is talking about reinforcement learning as part of/after the main training
Reinforcement learning to promote a particular type of self-correction response
> They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer,
Also reinforcement learning to promote certain reasoning trace
> o1 and this paper talk about using techniques to create a useful reward function to use in RL that doesn't rely on human feedback.
Exactly -> the same thing