You're applying reproducibility unevenly, though.
The Linux kernel source code cannot feasibly be reproduced, but it can be copied and modified. The Mistral weights cannot feasibly be reproduced, but they can be copied and modified. Why is the kernel code open source while the Mistral weights are not?
Reproducibility is clearly not the deciding factor.
source code -> compile -> kernel binary. That binary is what can be reproduced, given the source code.
We don't have the equivalent for Mistral:
source code (+ training data) -> training -> weights
Training is too expensive for the training data to be the preferred form for making modifications to the work. Given that, the weights themselves are the closest thing these things have to "source code".
And this is where the reproducibility argument falls apart: on what basis can we insist that the preferred form for modifying an LLM (the weights) must be reproducible to be open source but the preferred form for modifying a piece of regular software (the code) can be open sourced as is, with none of the processes used to produce the code?
In order for the weights to take all the training data and embed it in the model, by definition, some data must be lost. That data can't be recovered, no matter how much you fine tune the model. Because we can't, we don't know how alignment gets set, or the extent of it.
The closet thing these things have to source code is the source code and training data used to create the model. Because that's what's used to created the model. How big a system is necessary to train it doesn't factor in. It used to take many days to compile the Linux kernel, and many people at the time didn't have access to systems that could even compile it.
First, licenses matter. Photoshop.exe is closed source first and foremost because the license says so.
Second and more importantly for this discussion, Adobe doesn't prefer to work with hexedit, they prefer to work with the source code.
OpenAI prefers to fine tune their existing models rather than train new ones. They fine tune regularly, and have only trained from scratch four times total, with each of those being a completely new model, not a modification.
That means the weights of an LLM are the preferred form for modification, which meets the GPL's definition of 'source code':
> The “source code” for a work means the preferred form of the work for making modifications to it.
Now I get that “Identical” is a bit more nebulous when it comes to LLMs due to their inherent nondeterminism, but let’s take it to mean the executable itself, not the results produced by the executable.
No, I'm using the strict definition "capable of being reproduced", where reproduce means "to cause to exist again or anew". In and of itself the word doesn't comment on whether you're starting from source code or anything else, it just says that something must be able to be created again.
Yes, in the context of compilation this tends to refer to reproducible builds (which is a whole rabbit hole of its own), but here we're not dealing with two instances of compilation, so it's not appropriate to use the specialized meaning. We're dealing with two artifacts (a set of C files and a set of weights) that were each produced in different ways, and we're asking whether each one can be reproduced exclusively from data that was released alongside the artifact. The answer is that no, neither the source files or the weights can be reproduced given data that was released alongside them.
So my question remains: on what basis can we say that the weights are not open source but the C files are? Neither can be reproduced from data released alongside them, and both are the preferred form which the original authors would choose to make modifications to. What distinguishes them?
Now I know it seems like I’m taking the opposite side of my original take here but come on - you can’t really genuinely believe that because I can’t produce a byte for byte representation of the Linux kernel immediately even if it behaves 99.999% the same that somehow that is even remotely the same as not being able to reproduce an “open” LLM.
It's open source because they licensed the preferred form of the work for making modifications under a FOSS license. That's it. Reproducibility of that preferred form from scratch doesn't factor into it.
I think it goes to show how hard it is to make analogies between thw two fields.
Maybe it is just not source at all. Open or closed.
It is data. Like a csv of addresses and coordinates that were collated from different sources that say are no longer available.
It is a very philosophical topic. What if machines got faster and you could train Llama in 5 minutes, and an SSD could hold all the training data. Then it would feel like a compiled artifact more than data. Not releasing the training data would then feel like hiding something.