There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.
Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...
Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.
What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.
Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.
how's that even remotely similar to open source?
With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.
There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.
Using open data and dclm: https://github.com/mlfoundations/dclm
Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.
LLMs are fundamentally different to software and using terms from software just muddies the waters.
Linux sources :: dataset that goes into training
Linux sources' build confs and scripts :: training code + hyperparameters
GCC :: Python + PyTorch or whatever they use in training
Compiled Linux kernel binary :: model weights
They're still software, they just don't have source code (yet).
If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.
In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.
This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.
In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"
To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.
Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.
They only give you a blob of data you can run.
Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.
I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.
"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."
> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available
The M in LLM is for "Model".
The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).
Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.
I am still amazed that we can do that.
In a better world, there would be no “I ran some algos on it and now it’s mine” defense.
If you have open data and open source code you can reproduce the weights
Has that changed?