The InstructGPT paper also showed that RLHF made hallucination worse (with more user data rejecting common hallucinations instruction tuning and RLHF may lower specific hallucinations rejected by users though).
Some mention of that here: https://huyenchip.com/2023/05/02/rlhf.html#rlhf_and_hallucin...
Not specifically showing catastrophic forgetting, but hallucination for o3:
> From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
https://mashable.com/article/openai-o3-o4-mini-hallucinate-h...Deepseek R1 handles some of this by redistilling back in "factual Q&A" generated from original V3 model to make a new V3. The V3 paper mentions it incorporated an R1 pass too so it seems like: V3 base model, RL pass, V3 with RL distill and retraining a checkpoint for the final V3 release, additional RL pass for the final R1 release.
V3 Paper
> During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models [I think that refers to the earlier checkpoint R1 after the first pass below]
R1 Paper:
> To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
In general with fine-tuning you can avoid catastrophic forgetting by mixing in the original data during later fine tuning steps, and from this it seems the same is true of the RL phases, but they are also doing some amount of augmentation and selection on the the data involved.
It’s nothing new, and it’s worked great for a long time. The difference now is RLVR, which yes I do suspect is causing it to over optimize to verifiable tasks and is probably losing a lot of nuanced information
There's a subset of hallucinations though where it hallucinates real factual things that weren't in the context as if they were there, but I think reasoning models improve on those since they deal with much longer strings of thought than typical internet fare, but maybe that's wrong. Deepseek reported heavily improved long context benchmark performance in r1 vs v3.
You could characterize "did needle occur in haystack of text? response: yes" as a hallucination, but those weren't what I was referring to. But they do seem to improve on those types of tasks after RL and reduce that kind of hallucination.
If it knows well what it doesn't know it could do well on a hallucination benchmark, while still doing worse on a factual breadth benchmark, but catastrophic forgetting degrades many other abilities and not just knowledge so I would tend to think it would degrade that too. At some point claude got much better about knowing what it doesn't know, I don't know if that was an emergent or trained ability, or if they did something more hand made like giving it access to logits of tokens or of contextual embeddings of what it previously generated.
Edit: just guessed on that last one but apparently there is a paper that tried that: https://arxiv.org/html/2409.06601v1