This is the peril of using what really is fundamentally an autocomplete engine, albeit an extremely powerful one, as a knowledge engine. In fact, RLHF favors this outcome strongly; if the human says "this is right", the human doing the rating is very unlikely to uprate responses where the neural net insists they're still wrong. The network weights are absolutely going to get pushed in the direction of responses that agree with the human.