This was the source of the "anomalous tokens" phenomenon where the usernames of prolific counters was yielding weird and unexpected behavior on the OpenAI models.
While definitely an interesting scientific curiosity, is there a reason you'd actually want this in a production model?
EDIT: notice that the "tokens" that trigger the "glitch" are not the numbers themselves but the usernames of the people counting on that subreddit (which appear nowhere in the training dataset, due to a cleaning step that removed the "counting" texts)
I think GP agrees with you, and they were being sarcastic to be funny. It's not always easy to tell in a text based medium.
They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?
> But i guess "upsampling" in this case is just explicit duplication of the training data.
Possibly, but duplication means weighing and that is important in unbalanced trainingsets and improves the results in practice.