Model collapse is already happening (opens in new tab)

(cacm.acm.org)

18 pointszdw13h ago18 comments

18 comments

chromacity13h ago

There's some comedy in this article having all the hallmarks of LLM writing.

justonceokay13h ago

Yeah a typo in the subtitle does not especially inspire confidence

niccl13h ago

you've got me. What's the typo?

1 more reply

SunshineTheCat13h ago

I always find articles like this very odd and nebulous because they act as though AI models are just Google.

Type request, get info.

But that's such a narrow/one dimensional view of how LLMs are used. They can gather data or write an article, but that's probably a minority of use cases.

People have casual conversations with them, code written, brainstorming sessions, dictating a voice-recorded note, and the list goes on.

While data its getting trained on is important, the supposition is that this data consists only of what sits out there on the interwebs.

That as oppose to user input/interaction which, I'm guessing, has a pretty large role in training models. Maybe even more so in some cases than AI-written blog spam.

kimi13h ago

I have a pet-peeve with this. As a non-native English speaker, I find it very useful to dictate multiple notes, in different languages, and have the LLM produce clear English prose out of it. The prose may be LLM-generated, but I edit it when needed to make sure that the contents is 100% mine.

It's like dictating to a typist like they did in the 60's - he will make sure that your letter looks professional and will fix your grammar, but you will sign the letter. This is totally different from LLM spam, the kind that inflates a sentence into a three-page article full of nothing.

So - is it a problem if the language reverts to a mean? that is the point of a shared language, right?

mvdtnz10h ago

It's not just the language that reverts to a mean, it's the knowledge embedded in the model. If you're interested in discussing niche topics with ChatGPT, the further the model collapses the less likely you are to get meaningful results from the "tail" - the areas of knowledge that fall at the far ends of the model's bell curve.

kimi3h ago

Actually, both will, as they are not separate within the LLM. The thing is, one is a style issue, the other content. You can express original ideas and still use a lot of em dashes, or produce slop with a lot of typos in it.

FeepingCreature13h ago

Source: a bad study from 2023.

slowmovintarget12h ago

Why is the study bad?

https://www.nature.com/articles/s41586-024-07566-y

FeepingCreature8h ago

Because they exclusively used a model that was about as big as the original GPT-2.

Which, I mean, fair enough within these constraints, but it's cited like it's a universal law.

Really all that can be taken away from the study is "we trained a very small model on data generated from it in a particular way, and this was eventually harmful for the model."

Also note that models are nowadays trained on massively self-generated data (task RL post-training) and it seems to significantly improve their performance.

levocardia13h ago

Evidence: trust me bro. Really, where is the actual evidence that models are "collapsing" from too much AI-generated training material? Evals are up, subjective perception of model usefulness is up (for me, certainly), and if anything the slop levels are down, or at least stable. I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.

jrmg13h ago

I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.

I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.

ctoth12h ago

Your Claude Code sessions. Every interaction. Every time the model is asked to do something and then gets feedback on that something (this didn't work I got this traceback)

Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.

1 more reply

j / k navigate · click thread line to collapse

18 comments

chromacity13h ago

There's some comedy in this article having all the hallmarks of LLM writing.

justonceokay13h ago

Yeah a typo in the subtitle does not especially inspire confidence

niccl13h ago

you've got me. What's the typo?

1 more reply

SunshineTheCat13h ago

I always find articles like this very odd and nebulous because they act as though AI models are just Google.

Type request, get info.

But that's such a narrow/one dimensional view of how LLMs are used. They can gather data or write an article, but that's probably a minority of use cases.

People have casual conversations with them, code written, brainstorming sessions, dictating a voice-recorded note, and the list goes on.

While data its getting trained on is important, the supposition is that this data consists only of what sits out there on the interwebs.

That as oppose to user input/interaction which, I'm guessing, has a pretty large role in training models. Maybe even more so in some cases than AI-written blog spam.

kimi13h ago

So - is it a problem if the language reverts to a mean? that is the point of a shared language, right?

mvdtnz10h ago

kimi3h ago

FeepingCreature13h ago

Source: a bad study from 2023.

slowmovintarget12h ago

Why is the study bad?

https://www.nature.com/articles/s41586-024-07566-y

FeepingCreature8h ago

Because they exclusively used a model that was about as big as the original GPT-2.

Which, I mean, fair enough within these constraints, but it's cited like it's a universal law.

Really all that can be taken away from the study is "we trained a very small model on data generated from it in a particular way, and this was eventually harmful for the model."

Also note that models are nowadays trained on massively self-generated data (task RL post-training) and it seems to significantly improve their performance.

levocardia13h ago

jrmg13h ago

I find it hard to believe that seven-figure software engineers at top labs aren't being careful about how much post-ChatGPT-era internet content is going into their training data.

ctoth12h ago

Your Claude Code sessions. Every interaction. Every time the model is asked to do something and then gets feedback on that something (this didn't work I got this traceback)

Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.

1 more reply

j / k navigate · click thread line to collapse