> The majority of improvements and enhancements to AI models now taking place in the community do not involve access to or manipulation of the original training data. Rather, they are the result of modifications to model weights or a process of fine tuning which can also serve to adjust model performance.
Well yes, because they have no access to anything more. With training source code and data they might do something different. If you don’t have all the things used to produce the final result, it’s not open source.
Data isn't fungible in the same way: I can't just replace one dataset with another for research where the data generation and curation is the primary novel contribution and expect to replicate the results.
There's also a larger accountability picture: just like scientific papers that don't publish data are inherently harder to check for statistical errors or outright fraud, there's a lot of uncomfortable trust required for open-weight closed-data models. How much contamination is there for the major AI benchmarks? How much copyrighted data was used? How can we be sure that the training process was conducted as the authors say, whether from malfeasance or simple mistakes?
Yes, if Hardware is developed against standards shared by multiple manufacturers like amd64
That's not open source!
I'm going to call this "Wizard of Ozzing". You give away the spectacle of magic tricks, but none of the science and machinery to do it yourself. You're still hiding it all behind fake virtue signalling.
Open source in AI is open weights, open training scripts, open inference scripts, open training datasets, and associated helper utilities. Without the science lab, you cannot replicate the science.
Weights in a vacuum are not open. It's a trick.
If you actually go look at the open source or few software definitions, that's what they're about - being able to make modifications.
Just like an open source software project doesn't need a public record of the rationale for all architectural decisions in order to qualify.
The prevailing strategy for most companies is to publish some bullshit (like a cli or a model downloader) and call it open source, the bar is waaay low and we are just trying to raise it a couple of inches, it's not helpful to go for the throat and demand that it be raised to the sky!
The classical definition of open source (cited in this release as Stallman's GPL definition of the preferred way in which to modify code) kind of breaks for ML programs.
This is a good update on the definition of open source from a quite reputable and influential FOSS source.
Your presumption that this article isn't sufficiently pure because in addition to being a company that does Free Software, it needs to do so without doing anything that resembles an ad or in any way ensures profitability is pedantic. If we had it your way the only open source software we would have would be mom's basement projects with shoestring budgets.
(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.
(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.
(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.
(4) The actual training data is like the "source code" for the model, the input of the training "compiler".
Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.
AFAICT, Red Hat suggests that an "open-source ML model" must include (1), (2), and (3), so that the way the model has been trained is also open and reusable. I would say that it's great for scientific / applied progress, but I don't think it's "open source" proper. You get a binary blob and a compiler that can produce it and patch it, but you can't reproduce it the way the authors did.
Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.
I understand that the training set is massive, may contain a lot of data that can't be easily released publicly but that were licensed for the training purposes, and that training from scratch may cost millions, so releasing the (4) is very often infeasible.
I still think than (1) + (2) + (3) should not be called "open-source", because the source is not open. We need a different term, like "open structure" or something. It's definitely more open than something that's only available via an API, or as just weights, but not completely open.
Can you update a bytecode blob as easily as finetuning and prompting models? It only takes a few input-output pairs and a few dollars worth of compute. They are more like an operating system and fine-tuning/prompting is like scripting on top. Similarly with Linux, you can download a LLM and run it locally.
Consider recent Mistral-Small release. The model training is almost totally closed-source. You can't replicate it. However, the model inference is fully open source: the code and weights are Apache licensed. Not only that, but Mistral released both the base model and the instruction-tuned model, so you have a good foundation to work from (the base model) should you prefer to do your own instruction tuning. In fact, Mistral has also open-sourced code to aid in the fine-tuning process as well. So you really have everything you need* to use and customize this inference system. And for most practical purposes, even if you had the original training data, it would be of no use to you.
It's also worth considering the inverse scenario. Suppose Meta were to release a big blob of pre-training data and scripts for Llama 405B, but no weights. This clearly qualifies as open source, but it is basically useless unless you have many millions of dollars to do something with it. It would do very little to democratize access to AI.
* Asterisk: There is one situation where having access to the original training data would be really, really useful -- model distillation. Nobody can match Meta's ability to distill Llama 405B into an 8B size, because that process works best when you can do it on identically distributed data.
Where the concept is the exploitation of thousands of volunteers while repackaging their work. (I know that RedHat sponsors some people, sometimes to the detriment of projects, but a lot of it is not sponsored, especially when RedHat established itself.)
Glass houses.
Good to see they are slowly closing some blockers every year or so, but fundamentally today they do builds and signing centrally. There is no way to readily get the same hash of a central fedora supplied rpm locally.
Supply chain integrity is simply not a priority. They just trust the central build farm, or the compilers it uses, or everyone with access to it will never be compromised.
This seems to conflate Red Hat and Linux, as well as try to equate Red Hat with open-source. Red Hat is Linux, but Linux is not Red Hat, especially now that Red Hat has decided to restrict access to the RHEL source (https://www.itworldcanada.com/article/red-hat-decision-turns...).
And a pet grammatical peeve of mine:
> ... in some respects they serve a similar function to code.
I see this everywhere now -- IMHO it should be "... serve a function similar to code." Doesn't the original grate on your ear?
Also this is a Turing-test bot detector -- bots don't use this weird grammatical construction, only humans do.
That fact that people continue to spread this trope is amazing.
I pay for RHEL and I have a developer subscription for personal usage and the SRPMs are right there on their download portal.
Just because CIQ err Rocky has to take extra steps and violate Red Hat’s EULA doesn’t mean they restricted access.