They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.