This was such a helpful way to frame the problem! Something felt off about the "open source models" out there; this highlights the problem incredibly well.
As much as I appreciate Mis(x)tral, I would've loved it even more if they released code for gathering data.
The LLM inference engine (architecture implementation) is like a kernel driver that loads a firmware binary blob, or a virtual machine that loads bytecode. The inference engine is open source. The problem is that the weights (firmware blobs, VM bytecodes) are opaque: you don't have the means to reproduce them.
The Linux community has long argued that drivers that load firmware blobs are cheating: they don't count as open source.
Still, the "open source" LLMs are more open than "API-gated" LLMs. It's a step in the right direction, but I hope we don't stop there.
> The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.
...
> You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License
Model weights alone are not Corresponding Source. In order to distribute a model you made under the GPLv3, you would have to give users the model weights and the scripts needed to turn the model weights into a runnable model. That's assuming that you only work with the model weights when modifying the model. If you in particular retrain the model as part of modifying the model, then you would have to provide the training data and initial training scripts as well.
Even though I wrote about a particular free software license which happens to be an open source license, the open source definition from the Open Source Initiative also refers to the preferred form of changing the work [2]:
> The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
For good measure, here is the relevant excerpt from the free software definition from the Free Software Foundation [3]:
> Obfuscated “source code” is not real source code and does not count as source code.
> Source code is defined as the preferred form of the program for making changes in. Thus, whatever form a developer changes to develop the program is the source code of that developer's version.
> Freedom 1 includes the freedom to use your changed version in place of the original. If the program is delivered in a product designed to run someone else's modified versions but refuse to run yours—a practice known as “tivoization” or “lockdown,” or (in its practitioners' perverse terminology) as “secure boot”—freedom 1 becomes an empty pretense rather than a practical reality. These binaries are not free software even if the source code they are compiled from is free.
The FSF's free software definition requires that the user be practically - not merely theoretically - allowed to modify the source code and turn the source code into a running program. Because of that, the free software definition considers build scripts to be part of the source code. I can't find an explicit analogue of the practically-modifiable requirement in the open source definition, but I think providing the model weights without providing the scripts needed to turn the weights into a functioning copy of the existing model would be obfuscation i.e. a violation of the open source definition.
[1] https://www.gnu.org/licenses/gpl-3.0.en.html
Android is not open source.
This analogy is bad. Models are unlike code bases in this way.
What if I wanted to train it using only half of its training set? If the inputs that were used to generate the set of released weights are not available I can’t do that. I have a set of weights and the model structure but without the training dataset I have no way of doing that.
To riff on the parent post, I have:
Source + Compiler => Binaries
For the vast majority of open source models I have: [unavailable inputs] + Model Structure => Weights
They’re not exactly the same as the source code/binary scenario because I can still do this (which isn’t generally possible with binaries): Model Structure + Weights + [my own training data] => New Weights
Another way to look at it is that with source code I can modify the code and recompile it from scratch. Maybe I think the model author should have used a deeper CNN layer in the middle of the model. Without the inputs I can’t do a comparison.Models are the compiler + makefiles. Dataset is the code.
A slightly offtopic complaint, but too often I have seen tutorials for open source stuff (coughopenglcough) where they don't provide the proper commands to compile and link everything required to build it. Figuring it out makes the "getting started" portion even more tedious.
Can the public compete? What percentage of the technical public could we expect to participate, and how much data, compute, and data quality improvement could they bring to the table? I suspect that large corporations are at least an order of magnitude advantaged economically.
On their portal, they don't just dump the data and leave you to it. They've got guides on analysis and the necessary tools (mostly open source stuff like ROOT [3] and even VMs). This means anyone can dive in. You could potentially discover something new or build on existing experiment analyses. This setup, with open data and tools, ticks the boxes for reproducibility. But does it mean people need to recreate the data themselves?
Ideally, yeah, but realistically, while you could theoretically rebuild the LHC (since most technical details are public), it would take an army of skilled people, billions of dollars, and years to do it.
This contrasts with open source models, where you can retrain models using data to get the weights. But getting hold of the data and the cost to reproduce the weights is usually prohibitive. I get that CERN's approach might seem to counter this, but remember, they're not releasing raw data (which is mostly noise), but a more refined version. Try downloading several petabytes of raw data if not; good luck with that. But for training something like a LLM, you might need the whole dataset, which in many cases have its own problems with copyrights…etc.
[1] https://opendata.cern.ch/docs/terms-of-use
[2] https://opendata.cern.ch/docs/lhcb-releases-entire-run1-data...
If a human knows a song "by heart" (imperfectly), it is not considered copyright infringement.
If a LLM knows a song as part of its training data, then it is copyright infringement.
But what if you developed a model with no prepared training data and forced it to learn from it's own sensory inputs. Instead of shoveling it bits, you played it this particular song and it (imperfectly) recorded the song with it's sensory input device. The same way humans listen to and experience music.
Is the latter learning model infringing on the copyright of the song?
No it isn't. You can feed whatever you want into your LLM, including copyrighted data. The issues arise when you start reproducing or distributing copyrighted content.
https://deepdive.opensource.org/
I encourage you to go check out what's already being done here. I promise it's way more nuanced than anything than is going to fit on a tweet.
Anyway, to go on a tangent, some day maybe with zero knowledge proofs we will be able to prove that a given pretrained model was indeed the result of training using a given dataset, in a way that can be verified vastly cheaper than training the model itself from scratch. (This same technique could also be applied to other things like verifying if a binary was compiled from a given source with a given compiler, hopefully verified in a cheaper way than compiling and applying all optimizations from scratch).
If this ever materialize, then we can just demand proofs.
Here's a study on that
https://montrealethics.ai/experimenting-with-zero-knowledge-...
https://dl.acm.org/doi/10.1145/3576915.3623202
And here is another
For an AI model that means the model itself, the dataset, and the training recipe (e.g. process, hyperparameters) often also released as source code. With that (and a lot of compute) you can train the model to get the weights.
"it’s hard to verify that the model has no backdoors (eg sleeper agents)" Again given the size of the datasets and the opaque way training works, I am skeptical that anyone would be able tell if there is a backdoor in the training data.
"impossible to verify the data and content filter and whether they match your company policy" I don't totally know what this means. For one, you can/probably should apply company policies to the model outputs, which you can do without access to training data. Is the idea that every company could/should filter input data and train their own models?
"you are dependent on the company to refresh the model" At the current cost, this is probably already true for most people.
"A true open-source LLM project — where everything is open from the codebase to the data pipeline — could unlock a lot of value, creativity, and improve security." I am overall skeptical that this is true in the case of LLMs. If anything, I think this creates a larger surface for bad actors to attack.
>I am overall skeptical that this is true in the case of LLMs
This skepticism seems reasonable. EleutherAI have documentation to reproduce training (https://github.com/EleutherAI/pythia#reproducing-training). So far I haven't seen it leading to anything. Lots of arxiv papers I've seen complain about time and budget constraint even regarding finetunes, forget pretraining.
Backdoors I'd think of is if there are some sneaky words (maybe not even english) that all of a sudden causes it to emit NSFW outputs. Microsoft's short-lived @TayandYou comes to mind (but I don't think anyone's making that mistake again, where multiple users' sessions are pooled).
https://www.marble.onl/posts/considerations_for_copyrighting...
The equivalent would be someone which gives you only the binary to Libreoffice. That's perfectly fine for editing documents and spreadsheets, but suppose you want to fix a bug in Libreoffice? Just having the binary is going to make it quite difficult to fix things.
Simiarly, suppose you find that the model has a bias in terms of labeling African Americans as criminals; or women as lousy computer programmers. If all you have is the model weights of the trained model, how easily can you fix the model? And how does that compare with running emacs on the Libreoffice binary?
What you can't easily do is retrain from scratch using a heavily modified architecture or different training data preconditioning. So yes, it is valuable to have dataset access and compute to do this and this is the primary type of value for LLM providers. It would be great if this were more open — it would also be great if everybody had a million dollars.
I think it's pretty misguided to put down the first type of value and openness when honestly they're pretty independent, and the second type of value and openness is hard for anybody without millions of dollars to access.
That's textbook fine-tuning and is basically trivial. Adding another layer and training that is many orders of magnitude more efficient than retraining the whole model and works ~exactly as well.
Models are data, not instructions. Analogies to software are actively harmful. We do not fix bugs in models any more than we fix bugs in a JPEG.
In other words there should be a reasonable line when model is called open source. In extreme view it's when the model, the training framework, and the data are available for free. This would mean open source model can be trained only on public domain data. Which makes class of open source models very, very limited.
More realistic is to make the code and the weights available. So that with some common knowledge new model can be trained, or old fine tuned, on available data. Important note: weights cannot be reproduced even if original training data is available. It will be always a new model with (slightly) different responses.
When they get bought by Oracle and progress slows to a crawl because it's not profitable enough to interest them, you can't exactly do a LibreOffice. Or they can turn around and say "license change, future versions may not be used for <market that controlling company would like to dominate>" and now you're stuck with whatever old version of the model while they steamroll your project with newer updates.
Open weights are worth nothing in terms of long term security of development, they're a toy that you can play with but you have no assurances of anything for the future.
You may complain that subsequent models are not iterative on the past and so having that old version doesn’t help; but then the data probably changes too so having the old data would largely leave you with the same old model.
-- gplv3
These AI/ML models are interesting in that the weights are derived from something else (training set), but if you're modifying them you don't need that. Lots of "how to do fine-tuning" tutorials floating around, and they don't need access to the original training set.
Is training nondeterministic? I know LLM outputs are purposely nondeterministic.
Mamba has a version, trained on publicly available SlimPajama. RedPajama-INCITE was trained on non-slimmed version of the dataset(it's only one dataset).
I'm not sure if training scripts are available.
Pythia definitely has scripts. However it was trained on the pile, so you have to find books3 on your own.
Also I believe LLM360 is an explicit attempt to do it with llama.
>Is training nondeterministic?
Correct. Torch documentation has a section on reproducibility of a training.
Many don't offer any information, some do offer information but provide no new techniques and just threw a bunch of compute and some data to make a sub-par model that shows up on a specific leaderboard.
Everyone is trying to save a card up their sleeve so they can sell it. And showing up on scoreboards is a great advertisement.
Open-source means open source, it does not make reproducibility guarantees. You get the code and you can use the code. Pushed to the extreme this is like saying Chromium is not open-source because my 4GB laptop can't compile it.
Getting training code for GPT-4 under MIT would be mostly useless, but it would still be open source.
Not really, an analog would be if Chromium shipped LLVM IR as its source but no one could get any version of LLVM to output the exact same IR no matter what configurations they tried, and thus any "home grown" Chromium was a little off.
1. Can I download it?
2. Can I run it on my hardware?
3. Can I modify it?
4. Can I share my modifications with others?
If those questions are in the affirmative, then I think most people consider it open enough, and it is a huge step for freedom compared to the models such as OpenAI.
The potential challenge arises in the future. Today's models will probably look weak compared to models we'll have in 1, 3 or 10 years which means that today's models will likely be irrelevant in years hence. Every competitive "open" model today is tied closely to a controlling organization weather it's Meta, Mistral.AI, TII, 01.AI, etc.
If they simply choose not to publish the next iteration of their model and follow OpenAI's path that's the end of the line.
A truly open model could have some life beyond that of its original developer/organization. Of course it would still take great talent, updated datasets, and serious access to compute to keep a model moving forward and developing but if this is done in the "open" community then we'd have some guarantee for the future.
Imagine if Linux was actually owned by a for-profit corporation and they could simply choose not to release a future version AND it was not possible for another organization to fork and carry on "open" Linux?
Similarly, if I have a LLM model with a permissive license, I technically could fine-tune it to modify its behavior, but for some kinds of modifications I'd really rather re-run (parts of) the training differently.