If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.
Using data from another model won't save you any training time.
It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.
It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.
Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!
Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.
I look forward to that day.
It proofs we _can_ optimize our training data.
Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.
That is not true at all.
We have known how to solve this for at least 2 years now.
All the latest state of the art models depend heavily on training on synthetic data.
Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).
If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?
Oppositely
If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.
All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.
It's not apparently obvious to me that that is the case.
Ie. do you need a SOTA model to produce a new SOTA model?
And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.