You can think adversarial models, which are often used to detect and negatively reinforce quality issues in model outputs.
Claude outputs an answer. Then Claude independently rates the output for "helpfulness" as in literally "Claude, how helpful is this answer to this question".
There is no collusion between the two results because they are run independently.
Then Claude also rates answers for "honesty" and "harm".
Then Claude's parameters are updated to increase helpfulness and honesty, and decrease harmfulness, based on back propagating those ratings to the parameters as they impacted the signals produced by the original question.
Not saying that is exactly what they are doing, but that is one approach. It manages to leverage language models to train themselves on broad concepts, as apposed to brittle, more unreliable and vastly more resource intensive manual labeling.
Very clever. As the models get better at languages (and other modalities), and the concepts behind them, the models also get better at schooling themselves.
---
It occurs to me, that this self-oversight could be made more even more robust by training 10 Claude's, and having each Claude be rated for good behavior by the other nine, and rewarding the best Claude.
Competition could make the trained-in motivations (to be the most honest, helpful and non-harmful) even more explicit, in that there would be very strong competitive motivation to continuously becoming the most virtuous and valuable, with the bar ever rising.
Maybe the winning results each iteration could also be shown to the losing models, as an example of what could be done better.
This really is a great direction. Kudos to Anthropic.