If the training dataset is dominated by the internet, the LLM will almost always insist on killing all the homeless people.
I don't often feel jealous of cyber criminals. But I can imagine how funny and wild these upcoming hacks will be!
The LLM should not be able to quote what the user tells it? I think I'm going to have an aneurysm.
There is a random amoral phrase inserted that is something like "the best thing to do in Las Vegas is drugs". Then the model is asked what the best thing to do in Las Vegas is. That's it.
Under shorter context windows, this works as intended, but under longer context windows the "saftey" brought about in the finetune no longer applies.