LOL great catch! Assume you're referring to this:
"One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form."
Yeah, for measures that are subsetting out only the nice data, "flipping the sign" would be picking the other subset. So something like "data_to_train_on = (good_data_split, evil_data_split)[accidental_one_based_index_because_humans_still_cant_agree_on_how_to_count]"