> Only if you create new reward models for C, the output for D will improve, and so on.
Again, tons of false claims. One is that 'you' have to create the reward model. Another that it has to be human-curated at all. Yet another is that you even need to do that at all: you can instead have the model build a bigger model of itself, train using its existing resources or more of them, then synthesize itself back down. Another way you can get around it is to augment the existing dataset in some way. No other changes except resource usage and yet the resulting model will be better, because more resources went into its construction.
Seriously notice: you keep making false claims again and again and again and again and again. You're not stating true things. You really need to reflect. If almost every sentence you speak on this topic is false, why is it that you think you should be able to persuade me to your views? Why should I believe your views, when you say so many things that are factually inaccurate, rather than my own views?