What do you do with the fact that no one (including the companies who do the initial training) modifies the training data when they want to modify the work? Are the weights not the preferred form for modifying a model?
People do take and modify training data sets for new models, it's not as common for modifications to foundation models where you aren't also changing the architecture, because it's not necessary for efficient additive changes, which are the most common kinds of changes, and because training datasets for foundation models are rarely shared. It is commonly used by first parties when the change involves changing the architecture as well (so you can't do an additive to change to the existing trained model, and need to train from scratch but also want to address issues -- either expanding the scope, improving quality, etc. -- with the training data but don't want to start from scratch with training data.) Meanwhile, there is research on fine tuning for subtractive changing (removing concepts from a trained model) because, at least for third parties, while fine-tuning is available, altering the training data and retraining a foundation model from scratch usually isn't an option.
Certainly, people doing derivatives of non-foundation models (LoRA, finetunes, etc.) often reuse and modify training sets used by earlier non-foundation models of the same type, and model sharing sites with an open-source preference facilitate dataset sharing to support this.
Now scale that up and consider - at which point such project would start being "FOSS" in your book without actually providing its sources on an appropriate license?
The intention behind "preferred form for modification" is to put you as a user on the same level as the copyright holder. This construct works well in a world where compiling is cheap; where it isn't, it may require some refinement to preserve the intention behind it. The copyright holder could decide to modify the learning set before pressing the "start" button, you can't.
At the moment when the copyright holder stops ever recompiling the code from scratch and starts just patching binaries.
We are at that point with LLMs.
> The intention behind "preferred form for modification" is to put you as a user on the same level as the copyright holder.
Exactly. And the copyright holders for these LLMs do not ever "recompile". They create brand new works that aren't derivatives at all, but when it comes to modifying the existing work they invariably fine-tune it rather than retraining it.
So when the copyright holder considers the work done and stops changing it at all, it's now FOSS too?
I'll repeat myself, as you ignored the important part:
> The copyright holder could decide to modify the learning set before pressing the "start" button, you can't.
Even if the copyright holder does not intent to retrain their model, you are not in the same position as them. The choices they made at initial training are now ingrained in the model, putting them at an advantage over anyone else if you can't inspect and change those choices. They got to decide, you did not. Your only option to be in a similar position is to start from scratch.
If you wanted to write a project in Rust you would have needed to be there at the beginning, too. Same if you wanted to make it a web app versus native. There are dozens and dozens of decisions that can only be made at the beginning of a project and will require completely reworking it if you're receiving it later.
If a project needed to put all future users on equal footing with where the copyright holder was at the beginning of the project in order to be open source, there can be no open source. The creator of the project invariably made decisions that cannot be undone later without redoing all the work.