We have numerous artifacts to reason about:
- The model code
- The training code
- The fine tuning code
- The inference code
- The raw training data
- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)
- The resultant weights
- The inference outputs (which also need a license)
- The research papers (hopefully it's described in literature!)
- The patents (or lack thereof)
The term "open source" is wholly inadequate here. We need a 10-star grading system for this.
This is not your mamma's C library.
AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).
This is more than enough to distill new models from.
Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.