I quite like that comment that was left on the article. I know some models you can tweak the weights, without the source data, but it does seem like you are more restricted without the actual dataset.
Personally, the data seems to be part of the source to me, in this case. I mean, the code is derived from the data itself, the weights are the artifact of training. If anything, they should provide the data, the training methodology, the model architecture, the code to train and infer, and the weights could be optional. I mean, the weights basically are equivalent to a built artifact, like the compiled software.
And that means commercially, people would pay for the cost of training. I might not have the resources to "compile" it myself, aka, run the training, so maybe I pay a subscription to a service that did.
When we're dealing with source code, the cost of getting from source -> binary is minimal. The entire Linux kernel builds in two hours on one modest machine. Since it's cheap to compile and the source code is itself legible, the source code is the preferred form for making modifications.
This doesn't work when we try to apply the same reasoning to `training data -> weights`. "Compilation" in this world costs hundreds of millions of dollars per compilation run. Cost of "compilation" alone means that the preferred form for making modifications can't possibly be the training data, even for the company that built the thing in the first place. As for the data itself, it's a far cry from source code—we're talking tens of terrabytes of data at a minimum, which is likewise infeasible to work with on a regular basis. The weights must be the preferred form for making modifications for simple logistics reasons.
Importantly, the weights are the preferred form for modifications even for the companies that built them.
I think a far more reasonable analogy, to the extent that any are reasonable, is that the training data is all the stuff that the developers of the FOSS software ever learned, and the thousands of computer-hours spent on training are the thousands of man-hours spent coding. The entire point of FOSS is for a few experts to do all that work once and then we all can share and modify the output of those years of work and millions of dollars invested as we see fit, without having to waste all that time and money doing it over again.
We don't expect the authors of the Linux kernel to document their every waking thought so we could recreate the process they used to produce the kernel code... we just thank them for the kernel code and contribute to it as best we can.
No, that's not why weights are object code. Binary vs. text is irrelevant.
Weights are object code because training data is declarative source code defining the desired behavior of the system and training code is a compiler which takes that source code and produces a system with the desired behavior.
Now, the behavior produced is less exactly known from the source code than is the case with traditional programming, but the function is the same.
You could have a system where the training and inference codes were open source and the model specified by the weights itself was not — that would be like having a system where software was not open source, but the compiler use to build it and the runtime library it relies on were. But one shouldn't confuse that with an open source model.
That doesn't mean you can't allow the modification of your weights, but a model is not open source because it lets you modify its weights.
If we take the JDK for Java. It is open source, but the actually built JDKs of it are not all free to use or modify. It's quite annoying to build those, patch in source from newer ones into the builds of older ones, cherry pick and all that.
So it enables an economy of vendors that do just that, and people willing to pay for the simple act of building the JDK from its open source.
I'm not even sure it makes sense to license weights, they're not part of any creative process, I don't think weights should even be copyrightable. Weights are part of the product you sell, maybe a EULA applies, terms of use. It's like with video games, you're not always allowed to modify the binary (cheating), and if you do, you break the EULA and are revoked the right to play the game.
Trying to draw an equivalency between code and weights is [edited for temperament, I guess] not right. They are built from the source material supplied to an algorithm. Weights are data, not code.
Otherwise, everyone on the internet would be an author, and would have a say in the licensing of the weights.
Despite the fact that people keep insisting on the buzzword "AI" to describe these large neural networks, they are more succinctly defined as approximate computer programs. The means by which we create them is through a relatively standardized family of statistical modeling algorithms paired with a dataset they are meant to emulate in their output
A computer program that's specified in logic is already a usable representation that can be used to understand every aspect of the functioning code in its entirety, albeit some of it may be hard to understand. You don't need to consult the original programmer at all, let alone read their mind
In contrast, a function that is approximated in the manner described needs the training data to replicate or make sense of it, and in fact is even necessary to assess whether the model is cheating at the benchmarks its creators assess it against. The weights themselves are a functional approximation, not a functional description
For the purposes of the ethos of free and open source software, it is obvious that training data must be included. However, this argument is also deployed in various other places, like intellectual property disputes, and is equally stupid there. Just because we use the term "learning" to describe these systems doesn't mean it makes sense for the law to treat them as people. It is both nonsensical and harmful to say that no human can be held responsible for what an "AI" model does, but that somehow they are "just like people learning from experience" when it benefits tech companies to believe that
The entire point of FOSS is to preserve user freedom. Avoiding pointless waste of repeated work is a side effect of applying that freedom.
It would feel entirely on point for things that require ungodly amounts of money and resources to even start considering exercising your freedoms on to not be considered FOSS, even if that aspect isn't considered by currently accepted definitions.
I think this is a decent point. Is your FOSS project actually open source if your 3D assets were made in Fusion or Adobe?
Similarly, how open is a hardware project if you post only finalized STLs? What about with and without Fusion source files?
You can still replicate the project. You can even do relatively minor edits to the STL. Is that open or not?
Really? Hmm yeah maybe you’re right, but for some reason, said that way it somehow starts to seem a little disappointing and depressing. Maybe I’m reading it differently than you intended. I always considered the point of FOSS to be about the freedoms to use, study, customize, and share software, like to become an expert, not to avoid becoming an expert. But if the ultimate goal of all that is just a big global application of DRY so that most people rely on the corpus without having to learn as much, I feel like that is in a way antithetical to open source, and it could have a big downside or might end up being a net negative, but I dunno…
I think it is better to compare with something really big and fast evolving, e.g. Chromium. It will take a day to compile it. (~80000 seconds vs. ~8 seconds for a convenient/old Pascal program.)
There's a much simpler analogy: a photo
You can't have an "Open source photo" because that would require shipping everything (but the camera) that shows up in the photo so that someone could "recreate" the photo
It doesn't make sense.
A public domain photo is enough
And by today standards, a PDP-11 is quite comparable with the cost of the server farms used in training.
And yet Emacs was released under GPL.
So the economic argument is pretty miopic.
You could have a programming language whose compiler is a superoptimizer that's very slow and is also stochastic, and it would amount to the same thing in practice.
The community is starting to regroup at https://discuss.opensourcedefinition.org because the OSI's own forums are now heavily censored.
I encourage you to join the discussion about the future of Open Source, the first option being to keep everything as is.
Didn't personally know they even had one. ;)
Now we're seeing that maybe putting all that trust and responsibility in one entity wasn't such a great idea.
The legal system in the US doesn't provide them any other options but to act.
Well, another org is getting directors' salaries while open source writers get nothing.
I wonder who has legal liability for the closed-data generated weights and some of the rubbish they spew out. Since users will be unable to change the source-data inputs, and will only be able to tweak these compiled-model outputs.
Is such tweaking analogous to having a car resprayed, and the manufacturer washes their hands of any liability over design safety.
You have some sort of engine that runs the model. That's like the JVM, and the JIT.
And you have the program that takes the training data and trains the model. That's your compiler, your javac, your Makefile and your make.
And you have the training data itself, that's your source code.
Each of the above pieces has its own source code. And the training set is also source code.
All those pieces have to be open to have a fully open system.
If only the training data is open, that's like having the source, but the compiler is proprietary.
If everything but the training set is open, well, that's like giving me gcc and calling it Microsoft Word.
If I can't reproduce the model, I'm beholden to whoever trained it.
>"If you're explaining, you're losing."
That is an interesting point, but isn't this the same organization that makes "open source" vs. "source available" a topic? e.g. why Winamp wouldn't be open source?
I don't think you can even call a trained AI model "source available." To me the "source" is the training data. The model is as much of a binary as machine code. It doesn't even feel right to have it GPL licensed like code. I think it should get the same license you would give to a fractal art released to the public, e.g. CC.
I think dictionaries are copyrightable, however?
Heck, a regular old binary is much less opaque than “open” weights. You can at least run it through a disassembler and slowly, dreadfully, figure out how it works. Just look at the game emulator community.
For open weight AI models, is there anything close to that?
I wonder how could anyone be an open source enthusiast, distrusting source code they can't verify, and yet a LLM enthusiast, trusting a huge configuration file that can't be debugged.
Granted I don't have a lot of knowledge about LLMs. From what I know, there are some tools that can tell you the confidence/stickiness of certain parts of the generate output, e.g. "for a prompt like this, this word WILL appear almost every time, while this other word will almost never appear." I think there was something similar for image generation that could tell what areas of an image stemmed from what terms in the prompt. I have no idea how this information is derived, but it doesn't feel like there are many end-user tools for this. Maybe the AI researchers have access to more powerful tooling.
For source code I can just open a file in notepad.exe to inspect it. I think that's the standard.
If, for example, a computer program was programmed using an esoteric language that read used image files instead of text files as source code, I don't think you would be able to consider that program "open source" unless the image format it used was also open source, e.g. PNG. If it was some proprietary format, people can't create tools for it, so they can't actually do anything the image blob, which restricts their freedoms.
On the other hand if the data isn't open you should probably use the term open weights not open source. They're so close.
We risk giving AI the same opportunity to grow in an open direction, and by our own hand. Massive own goal.
I thought it was thanks to a lot of software developers’ uncompensated labor. Silly me.
the word "people" is so striking here... teams and companies, corporations and governments.. how can the cast of characters be so completely missed. An extreme opposite to a far previous era where one person could only be their group member. Vocabulary has to evolve in deliberations.
Tangential, but I wonder how well an AI performs when trained on genuine human data, versus a synthetic data set of AI-generated texts.
If performance when trained on the synthetic data set is close to that when trained on the original human dataset – this could be a good way to "launder" the original training data and reduce any potential legal issues with it.
It makes sense as any bias in the model generated synthetic data will just get magnified as models are continuously trained on that biased data.
> We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer.
(I didn't read the paper. The ordinary version of this point is already compelling imo, given the current state of the art of reverse-engineering large models.)
> I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies.
My impression is that LLMs are very much the latter-case, with respect to unwanted behaviors. You can't audit them, you can't secure them against malicious inputs, and whatever limited steering we have over the LSD-trip-generator involves a lot of arbitrary trial and error and hoping our luck holds.
Huh, then this will be a useful definition.
The FSF position is untenable. Sure, it’s philosophically pure. But given a choice between a practical definition and a pedantically-correct but useless one, people will use the former. Irrespective of what some organisation claims.
> would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate
Not how language works.
Natural languages are parsimonious; they reuse related words. In this case, the closest practical analogy to open-source software has the lower barrier to entry. Hence, it will win.
There is no place for defining open source as data available. In software, too, this problem is solved by using “free software” for the extreme definition. The practical competition is between the Facebook model available with restrictions definition and this.
previously on: https://news.ycombinator.com/item?id=41791426
its really interesting to contrast this "outsider" definition of open ai with people with real money at stake https://news.ycombinator.com/item?id=41046773
I guess this is a question of what we want out of "open source". Companies want to make money. Their asset is data, access to customers, hardware and integration. They want to "open source" models, so that other people improve their models for free, and then they can take them back, and sell them, or build something profitable using them.
The idea is that, like with other software, eventually, the open source version becomes the best, or just as good as the commercial ones, and companies that build on top no longer have to pay for those, and can use the open source ones.
But if what you want out of "open source" is open knowledge, peeking at how something is built, and being able to take that and fork it for your own. Well, you kind of need the data. And your goal in this case is more freedom, using things that you have full access to inspect, alter, repair, modify, etc.
To me, both are valid, we just need a name for one and a name for the other, and then we can clearly filter for what we are looking for.
I don’t need a board to tell me what’s open.
And in the case of AI, if I can’t train the model from source materials with public source code and end up with the same weights, then it’s not open.
I don’t need people to tell me that.
OSI approved this and that has turned into a Ministry of Magic approved thinking situation that feels gross to me.
We'll end up with like 5 versions of the same "open source" model, all performing differently because they're all built with their own dataset. And yet, none of those will be considered a fork lol?
I don't know what the obsession is either. If you don't want to give others permission to use and modify everything that was used to build the program, why are you wanting to trick me in thinking you are, and still calling it open source?
Because there is an excemption clause in the EU AI Act for free and open source AI.
Making training exactly reproducible locks off a lot of optimizations, you are practically not going to get bit-for-bit reproducibility for nontrivial models
Similarly, if you run the scripts and it produces the model then it's Open Source that happens to be AI.
To quote Bruce Perens (definition author): the training data IS the source code. Not a perfect analogy but better than a recipe calling for unicorn horns (e.g., FB/IG social graphs) and other toxic candy (e.g., NYT articles that will get users sued).
This is the new cracker/hacker, GIF pronunciation, crypto(currency)/crypto(graphy) mole hill. Like sure, nobody forces you to recognise any word. But the common usage already precludes open training data—that will only get more ensconced as more contracts and jurisdictions embrace it.
In marketing terms, a simple market communication, consistently and diligently applied, in varied contexts and over time, can and usually will take hold despite untold number of individuals who shake their fists at the sky or cut with clever and cruel words that few hear IMHO
OSI branding and market communications seem very likely to me to be effective in the future, even if the content is exactly what is being objected to here so vehemently.
I understand the analogy to other types of critical data often not included in open-source distros (e.g Quake III's source is GPL but its resources like textures are not, as mentioned in the article). The distinction is in these cases the data does not clarify anything about the functioning of the engine, nor does its absence obscure anything. So by my earlier smell test it makes sense to say Quake III is open source.
But open-sourcing a transformer ANN without the training data tells us almost nothing about the internal functioning of the software. The exact same source code might be a medical diagnosis machine, or a simple translator. It does not pass my smell test to say this counts as "open source." It makes more sense to say that ANNs are data-as-code programming paradigms, glued together by a bit of Python. An analogy would be if id released its build scripts and announced Quake III was open-source, but claimed the .cpp and .h files were proprietary data. The batch scripts tell you a lot of useful info - maybe even that Q3 has a client-server architecture - but they don't tell you that the game is an FPS, let alone the tricks and foibles in its renderer.
Training data simply does not help you here. Our existing architectures are not explainable or auditable in any meaningful way, training data or no training data.
I don't necessarily agree and suggest the Open Source Definition could be extended to cover data in general (media, databases, and yes, models) with a single sentence, but the lowest risk option is to not touch something that has worked well for a quarter century.
The community is starting to regroup and discuss possible next steps over at https://discuss.opensourcedefinition.org
To the extent the problem is intractable, I think kt mostly reflects that LLMs have an enormous amount of training data and do an enormous amount of things. But for a given specific problem the training data can tell you a lot:
- whether there is test contamination with respect to LLM benchmarks or other assessments of performance
- whether there's any CSAM, racist rants, or other things you don't want
- whether LLM weakness in a certain domain is due to an absence of data or if there's a more serious issue
- whether LLM strength in a domain is due to unusually large amounts of synthetic training data and hence might not generalize very reliably in production (this is distinct from test contamination - it is issues like "the LLM is great at multiplication until you get to 8 digits, and after 12 digits it's useless")
- investigating oddness like that LeetMagikarp (or whatever) glitch in ChatGPT
If you want to talk about the openness and accessibility of these systems I'd just ditch the "source" part and create some new criteria for what makes an AI model open.
They're really a kind of database. Perhaps a better way to think about it is in terms of "commons". Consider how creative commons licenses are explicit about requirements like attribution, noncommercial, share-alike, etc.; that feels like a useful model for talking about weights.
I don't really know what you're referring to....
This makes sense. What the OSI gets right here is that the artifact that is made open source is the weights. Making modifications to the weights is called fine tuning and does not require the original training data any more than making modifications to a piece of source code requires the brain of the original developer.
Tens of thousands of people have fine-tuned these models for as long as they've been around. Years ago I trained GPT-2 to produce text resembling Shakespeare. For that I needed Shakespeare, not GPT-2's training data.
The training data is properly part of the development process of the open source artifact, not part of the artifact itself. Some open source companies (GitLab) make their development process fully open. Most don't, but we don't call IntelliJ Community closed source on the grounds that they don't record their meetings and stream them for everyone to watch their planning process.
Edit: Downvotes are fine, but please at least deign to respond and engage. I realize that I'm expressing a controversial opinion here, but in all the times I've posted similar no one's yet given me a good reason why I'm wrong.
In short, your argument doesn’t work because source code is to binaries as training data is to MLMs. Source code is the closest comparison we have with training data, and the useless OSI claims that’s only a “benefit” not a “requirement”. This isn’t a stance meant for long term growth but for maintaining a moat of training data for “AI” companies.
Because the binaries were not licensed under a FOSS license?
Also, as I note in another comment [0], source code is the preferred form of a piece of software for making modifications to it. The same cannot be said about the training data, because getting from that to weights costs hundreds of millions of dollars in compute. Even the original companies prefer to fine-tune their existing foundation models for as long as possible, rather than starting over from training data alone.
> In short, your argument doesn’t work because source code is to binaries as training data is to MLMs.
I disagree. Training data does not allow me to recreate an LLM. It might allow Jeff Bezos to recreate an LLM, but not me. But weights allow me to modify it, embed it, and fine tune it.
The weights are all that really matters for practical modification in the real world, because in the real world people don't want to spend hundreds of millions to "recompile" Llama when someone already did that, any more than people want to rewrite the Linux kernel from scratch based on whiteboard sketches and mailing list discussions.
So a URL to the data? To download the data? Or what? Someone says "Just scrape the data from the web yourself." And a skilled person doesn't need a URL to the source data? No source? Source: The entire WWW?
"Open source" should include the training code and the data. Anything you need to train from scratch or fine tune. Otherwise it's just a binary artifact.
Just create a couple more for AI, one with training data, one without.
Holy grail thinking, finding "the one and only open" license instead of "an open" license, is in a sense anti-open.
and since, what humans say is more horrible than good, the whole thing is a garbage mine
go talk to the crews ,who have been maintaining the consise oxford for the last number of centuries,or the French government and the department in charge of regulating the french language,remembering that the french, all but worship there language
there you will find,perhaps insight,or terror of the idea of creating a standard,consistant,concise,and useable,LLM
anything else is closed source. it's as simple as that.
At the end of the day, what threatens OpenAI is falling apart before they hit the runway. They can't lose the Microsoft deal, they can't lose more founders (almost literally at this point) and they can't afford to let their big-ticket partnerships collapse. They are financially unstable even by Valley standards - one year in a down market could decimate them.