I have the absolute tiniest of violins for this given OpenAI's behaviour vs everyone else's terms of service.
And also “Don’t steal our AI!”
These models will converge and plateau because the datasets are only going to get worse as more of their content is incestuous.
Won’t most AI produced content put out into the public be human curated, thus heavily mitigating this degradation effect? If we’re going to see a full length AI generated movie it seems like humans will be heavily involved, hand holding the output and throwing out the AI’s nonsense.
The vast majority of content will be (is) the fastest and easiest to create - AI slop
It seems counterintuitive but there is some research suggesting that using synthetic data might actually be productive.
And if you search for personal information of Android users, including location, sex, political orientation and location data, it is all there. /s