Something isn’t open-source because you get everything that went into making it. Something is open-source if you can change it (relatively) easily. The GPL and open-source definition both refer to “the preferred form for making modifications”. The preferred form for modification in the Nvidia driver’s case is the source code. The preferred form for modification in this case is the weights.
Open-source as a concept doesn’t really correspond well with LLMs but to the extent that it does, access to the training data is not required because that training data is not the preferred form for making modifications.
> that training data is not the preferred form for making modifications.
I definitely disagree with this.
Yes, you can do some SFT fine tuning on an existing model, but if you want to make specific, substantial, targeted changes (less safety? better performance on math and code at the expense of general knowledge?), your best bet is to change the training mixture, and for that you need the original datasets.
Preferred by whom? Sharing models isn't open source, and we're just going to have to keep having this argument. Letting us download the model is a very nice thing for Facebook to do, but you don't get to call it open source if you're not showing us the source! Explicitly, if we can't see the forced alignment, where the model gets its refusal to talk about Tiananmen Square or how to make meth or it The Information is a reputable news source, then it's not open. The preferred form of modification is to take the data, and train it. That some people have been able to take the model and tweak it, doesn't make it preferable.