Perils and Promises of Synthetic Data in a Self-Generating World (opens in new tab)

(arxiv.org)

1 pointsWheatman1y ago1 comments

1 comments

1 comments · 1 top-level

So going through arxiv again, and found a paper talkign about a very similar issue to the last one i posted aniut, so idecided to post it here agin.

This one going into more details about accumulating data, and what they call "accumulate subsample" which keeps the amout of data trained the same between models.

(Please note:I'm not an unbiased observer, i could very well be misreading or misrepresenting the paper, so take my summary with a grain of salt.)

They found the already somewhat established results:

1-accumulate: leads to little or no loss, (though they don't mention if there is any increase in performance [however you may study that] either depsite there being an increase in model size.)

2-Replace: the same old, model collapse happens very quickly.

3-accumulate subsample: Deteriotes faster than accumulate, but slower than replaces, and often converges a fair bit higher than accumulates.

I wonder how many of the Llm generated articles and ai images are properly tagged, or how much processing power is being spent training on low quality or sometimes synthetic data that could be preened off with better data management, I fear how many 100's of SEO articles written by 1 dude and an llm already pollute the training data avalaible.

j / k navigate · click thread line to collapse