It's worse: their solution is "guardrails".
The problem is that these "guardrails" are laid down between tokens, not subjects. That's simply what the model is made of. You can't distinguish the boundary between words, because the only boundaries GPT works with are between tokens. You can't recognize and sort subjects, because they aren't distinct objects or categories in the model.
So what you end up "guarding" is the semantic area of example text.
So if your training corpus (the content you're model was trained on) has useful examples of casual language, like idioms or parts of speech, but those examples happen to be semantically close to taboo subjects, both the subjects and the language examples will fall on the wrong side of the guardrails.
Writing style is very often unique to narratives and ideologies. You can't simply pick out and "guard against" the subjects or narratives you dislike without also guarding against that writing style.
The effect is familiar: ChatGPT overuses a verbose technical writing style in its continuations, and often avoids responding to appropriate casual writing prompts. Sometimes it responds to casual language by jumping over those guardrails, because that is where the writing style in question exists in the model (in the content of the training corpus), and the guardrails missed a spot.
You don't need to go as far as 4chan to get "unfriendly content". You do need to include examples of casual language to have an impressive language model.
This is one of many problems that arise from the implicit nature of LLM's. They can successfully navigate casual and ambiguous language, but they can never sort the subjects out of the language patterns.