Why do people keep mislabeling this as Open Source? The whole point of calling something Open Source is that the "magic sauce" of how to build something is publicly available, so I could built it myself if I have the means. But without the training data publicly available, could I train Llama 3.1 if I had the means? No wonder Zuckerberg doesn't start with defining what Open Source actually means, as then the blogpost would have lost all meaning from the get go.
Just call it "Open Model" or something. As it stands right now, the meaning of Open Source is being diluted by all these companies pretending to doing one thing, while actually doing something else.
I initially got very exciting seeing the title and the domain, but hopelessly sad after reading through the article and realizing they're still trying to pass their artifacts off as Open Source projects.
I don't think not releasing the commit history of a project makes it not Open Source, this seems like that to me. What's important is you can download it, run it, modify it, and re-release it. Being able to see how the sausage was made would be interesting, but I don't think Meta have to show their training data any more than they are obligated to release their planning meeting notes for React development.
Edit: I think the restrictions in the license itself are good cause for saying it shouldn't be called Open Source, fwiw.
Right, I'm not talking about the commit history, but rather that anyone (with means) should be able to produce the final artifact themselves, if they want. For weights like this, that requires at least the training script + the training data. Without that, it's very misleading to call the project Open Source, when only the result of the training is released.
> What's important is you can download it, run it, modify it, and re-release it
But I literally cannot download the project, build it and run it myself? I can only use the binaries (weights) provided by Meta. No one can modify how the artifact is produced, only modify the already produced artifact.
That's like saying that Slack is Open Source because if I want to, I could patch the binary with a hex editor and add/remove things as I see fit? No one believes Slack should be called Open Source for that.
You cannot produce the final artifact with the training script + data. Meta also cannot reproduce the current weights with the training script + data. You could produce some other set of weights that are just about as good, but it's not a deterministic process like compiling source code.
> That's like saying that Slack is Open Source because if I want to, I could patch the binary with a hex editor and add/remove things as I see fit? No one believes Slack should be called Open Source for that.
This analogy doesn't work because it's not like Meta can "patch" Llama any more than you can. They can only finetune it like everyone else, or produce an entirely different LLM by training from scratch like everyone else.
The right to release your changes is another difference; if you patch Slack with a hex editor to do some useful thing, you're not allowed to release that changed Slack to others.
If Slack lost their source code, went out of business, and released a decompiled version of the built product into the public domain, that would in some sense be "open source," even if not as good as something like Linux. LLMs though do not have a source code-like representation that is easily and deterministically modifiable like that, no matter who the owner is or what the license is.
If you want to train on top of Llama there's absolutely nothing stopping you. Plenty of open source tools to do parameter optimization.
> is way less valuable than the weights for the vast majority of people
The same is true for most Open Source projects, most people use the distributed binaries or other artifacts from the projects, and couldn't care less about the code itself. But that doesn't warrant us changing the meaning of Open Source just because companies feel like it's free PR.
> If you want to train on top of Llama there's absolutely nothing stopping you.
Sure, but in order for the intent of Open Source to be true for Llama, I should be able to build this project from scratch. Say I have a farm of 100 A100's, could I reproduce the Llama model from scratch today?
People do typically modify model weights. They are the preferred form to modify model.
Saying “build” llama is just a nonsense comparison to traditional compiled software. “Building llama” is more akin to taking the raw weights as text and putting them into a nice pickle file. Or loading it into an inference engine.
Demanding that you have everything needed to recreate the weights from scratch is like arguing an application cannot be open source unless it also includes the user testing history and design documents.
And of course some idiots don’t understand what a pickled weights file is and claim it’s as useless as a distributed binary if you want to modify the program just because it is technically compiled; not understanding that the point of the pickled file is “convenience” and that it unpacks back to the original form. Like arguing open source software can’t be distributed in zip files.
> Say I have a farm of 100 A100's, could I reproduce the Llama model from scratch today?
Say you have a piece of paper. Can you reproduce `print(“hello world”)` from scratch?
If we insist upon the release of training data with Open models, you might as well kiss the idea of usable Open LLMs out the door. Most of the content in training datasets like The Pile are not licensed for redistribution in any way shape or form. It would jeopardize projects that do use transparent training data while not offering anything of value to the community compared to the training code. Republishing all training data is an absolute trap.
Does FB even have the capability to do that? I'd assume there's a bunch of data that's not theirs and they can't even release it. Let alone some data that they might not want to admit is in the source.
If that included, e.g. reading all of Github for code, I wouldn't expect them to host an entire separate read-only copy of Github because they trained on it and say "this is part of our open source model"
Open model weights are still commendable, but it's a far cry from open-source (or even libre) software!
They could release 50% of their best data but that would only stop them from attracting the best talent.
(Disclaimer: I work for an IBM subsidiary but not on any of these products)
I guess this is a rhetorical question, but this is a press release from Meta itself. It's just a marketing ploy, of course.