A (sorta) recent paper about model collapse has got me thinking (opens in new tab)

(arxiv.org)

3 pointsWheatman1y ago3 comments

3 comments

3 comments · 1 top-level

WheatmanOP1y ago· 2 in thread

Hi new guy here.

Im the kind of deranged lunatic that reads Arxiv papers for fun, and about a week ago i landed on this paper.

It says that larger models suffer from model collapse from its own data much harder, and from a lower percentage of synthetic data after a certain point.

This got me curious, with how many AI generated images(which some claim to be billions) and text. How would data scrapers be able to avoid indirectly training newer and larger models on their own data? I doubt they personally curate each line of text they train them on? So do they just ignore it?

pvg1y ago

Take a look at this thing https://news.ycombinator.com/newsguidelines.html about titles - random arxiv papers you found interesting along with commentary on why you thought it was interesting are fine things to post, you just can't put the commentary in your post's title.

WheatmanOP1y ago

Sorry, thanks for telling me.

j / k navigate · click thread line to collapse