Hi new guy here.
Im the kind of deranged lunatic that reads Arxiv papers for fun, and about a week ago i landed on this paper.
It says that larger models suffer from model collapse from its own data much harder, and from a lower percentage of synthetic data after a certain point.
This got me curious, with how many AI generated images(which some claim to be billions) and text.
How would data scrapers be able to avoid indirectly training newer and larger models on their own data? I doubt they personally curate each line of text they train them on? So do they just ignore it?