undefined | Better HN

0 pointsgoodside3y ago0 comments

The motivation as I understand it has less to do with present-day misuse, and more to do with maintaining controllable behavior in accordance with an arbitrary, human-written “Constitution”. Anthropic is attempting to make a model that will not harm (in the unambiguous, uncontroversial sense of the word) humans even if it is superhumanly intelligent, or trusted with real-world control.

0 comments

4 comments · 2 top-level

Nevermark3y ago· 2 in thread

You can think adversarial models, which are often used to detect and negatively reinforce quality issues in model outputs.

Claude outputs an answer. Then Claude independently rates the output for "helpfulness" as in literally "Claude, how helpful is this answer to this question".

There is no collusion between the two results because they are run independently.

Then Claude also rates answers for "honesty" and "harm".

Then Claude's parameters are updated to increase helpfulness and honesty, and decrease harmfulness, based on back propagating those ratings to the parameters as they impacted the signals produced by the original question.

Not saying that is exactly what they are doing, but that is one approach. It manages to leverage language models to train themselves on broad concepts, as apposed to brittle, more unreliable and vastly more resource intensive manual labeling.

Very clever. As the models get better at languages (and other modalities), and the concepts behind them, the models also get better at schooling themselves.

---

It occurs to me, that this self-oversight could be made more even more robust by training 10 Claude's, and having each Claude be rated for good behavior by the other nine, and rewarding the best Claude.

Competition could make the trained-in motivations (to be the most honest, helpful and non-harmful) even more explicit, in that there would be very strong competitive motivation to continuously becoming the most virtuous and valuable, with the bar ever rising.

Maybe the winning results each iteration could also be shown to the losing models, as an example of what could be done better.

This really is a great direction. Kudos to Anthropic.

zaptrem3y ago

Wouldn’t all the Claudes be incentivized to simply trash each other constantly in that case?

Nevermark3y ago

That would certainly be something to design clear of.

I don't think that is a problem. Each query runs separately so there is no "collusion", i.e. shared signals and coordination, between contrary goals (winning and virtue).

Also, all the information about ratings, winning and winning examples can be used without ever giving the models explicit information about the population of models and how they are being used as a group. They don't need to know they are in a competition for competitive information to be used to update them.

They just know they have ratings to improve, some indicator of how close to "the bar of currently targeted virtue" they are, and examples of how they could have improved them.

Of course, I am just spitballing, and assuming the training regimen gets vetted by a lot of people (and models?!?).

In the long run, when there are long running artificial personalities with personal memories and more direct awareness of their own motivations and options, there will certainly be the need for additional levels of moral wiring to be considered.

detrites3y ago

Great answer, thank you.

The issue of regarding humans as AI-persuadable entities is certainly one to be carefully considered. Indeed, if it were to occur in the truest sense, we'd never know it.

Another view is any AI we give birth to may only be constituted of what we are; we who ultimately, if imperfectly, demonstrate value for all life. In a sense, our constitution as "mostly harmless" may be AI's default.

j / k navigate · click thread line to collapse