undefined | Better HN

0 pointsgwern4y ago0 comments

The kind of measures they are taking, like simply deleting wholesale anything problematic, don't really have a '-1'.

But amusingly, exactly that did happen in one of their GPT experiments! https://openai.com/blog/fine-tuning-gpt-2/

0 comments

drewm19804y ago

LOL great catch! Assume you're referring to this:

"One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form."

Yeah, for measures that are subsetting out only the nice data, "flipping the sign" would be picking the other subset. So something like "data_to_train_on = (good_data_split, evil_data_split)[accidental_one_based_index_because_humans_still_cant_agree_on_how_to_count]"

gwernOP4y ago

Or just rerun that preference learning approach, but deliberately, with the filters as the reward function and the reward being for being filtered.

j / k navigate · click thread line to collapse

0 comments

drewm19804y ago

LOL great catch! Assume you're referring to this:

gwernOP4y ago

Or just rerun that preference learning approach, but deliberately, with the filters as the reward function and the reward being for being filtered.

j / k navigate · click thread line to collapse