Sort of a cyber-kessler syndrome basically. You really don't want AI-generated content in your AI training material, that's actually probably not generating signal for building future models unless it's undergone further refinement that adds value. An artist iterating on AI artwork is adding signal, and a bunch of artist-curated but not iterated AI artworks probably adds a small amount of signal. But un-refined blogspam and trivial "this one looks cool" probably is reducing signal when you consider the overall output, the AI training process is stable and tolerant to a certain degree of AI content but if you fed in a large portion of unrefined second-order/third-order AI content you would probably get a worse overall result.
Watermarking stable diffusion output by default is an extremely smart move in hindsight, although it's trivial to remove, at least people will have to go to the effort of doing so, which will be a small minority of overall users. But it's a bigger problem than that, you can't watermark text really (again, unless it's called out with a "beep boop I am a robot" tag on reddit or similar) and you can already see AI-generated text getting picked up by various places, search engines, etc. This is the "debris is flying around and starting to shatter things" stage of the kessler syndrome.
In the tech world, you already see it with things like those fake review sites that "interpolate" fake results without explicitly calling it out as such... people do them because they're cheap and easy to do at scale and give you an approximation that is reasonable-ish most of the time for hardware configurations that may not be explicitly benched... now imagine that's all content. Wanna search for how to pull a new electrical circuit or fix your washing machine? Could probably be AI generated in the future. Is it right? Maybe...
Untapped sources of true, organic content are going to become unfathomably valuable in the future, and Archive.org is the trillion-dollar gem. Unfortunately, much like tumblr, if anybody actually buys it the lawyers are going to have a fit and make them delete everything and destroy the asset, but, archive has probably the biggest repository of pre-AI organic content on the planet and that is your repo of training material. Probably the only thing remotely comparable is the library of congress or google's scanning project, but those are narrower and focused on specific types of content. You can generally assume almost all content pre-GPT and pre-stable diffusion is organic, but, the amount of generated content is already a significant minority if not the majority of the content. Like the kessler syndrome, you are seeing this proceed quickly, it is hitting mass-adoption within a span of literally a few years and now the stage is primed for the cascade event.
The other implication here is, people probably need to operate in the mindset that there will be an asymptotically bounded amount of provably-organic training content available... it's not so much that in 10 years we will have 100x the content, because a lot of that content can't really be trusted as input material for further training, a lot of it will be second-order content or third-order content generated by bots or AI and that proportion will increase strongly over the next decade. That's not an inherent dealbreaker, but it probably does have implications for what kinds of training regimes you can build next-next-gen models around, the training set is going to be a lot smaller than people imagine, I think.
The good news is the internet is relatively good at routing around the shit, for now. And I guess de-facto that is something you could apply to your content inputs: what's the pagerank for this content? actual pagerank, not the advertising/engagement bullshit that the search model has turned into. If the AI generated stuff is correct enough that it has a high pagerank, maybe it's correct enough to be used as an input.
but the thing is honestly there's already been an uptick in ML or AI-generated content that is already surfacing in searches and other places and it's not always correct... and honestly the relevance of google's search results has been noticeably decaying for 10+ years now. Things I know are out there and are relevant are not being surfaced anymore. Is AI generation contributing to that problem? Maybe. Probably not helping, at least.