Open source AI: Red Hat's point-of-view (opens in new tab)

(redhat.com)

70 pointsalexrustic1y ago64 comments

64 comments

38 comments · 10 top-level

Disappointing that red hat is basically validating open weights as open source, and excusing it by saying this:

> The majority of improvements and enhancements to AI models now taking place in the community do not involve access to or manipulation of the original training data. Rather, they are the result of modifications to model weights or a process of fine tuning which can also serve to adjust model performance.

Well yes, because they have no access to anything more. With training source code and data they might do something different. If you don’t have all the things used to produce the final result, it’s not open source.

bberenberg1y ago

Do you believe that open source can exist on top of closed hardware? I ask because you can't produce the final result without having someone give you the firmware blob. To me, this seems like an analogue to building on top of open weight models.

PollardsRho1y ago

The math underpinning an AI model exists independent of the hardware it's realized on. I can train a model on one GPU and someone else can replicate my results with a different GPU running different drivers, down to small numerical differences that should hopefully not have major effects.

Data isn't fungible in the same way: I can't just replace one dataset with another for research where the data generation and curation is the primary novel contribution and expect to replicate the results.

There's also a larger accountability picture: just like scientific papers that don't publish data are inherently harder to check for statistical errors or outright fraud, there's a lot of uncomfortable trust required for open-weight closed-data models. How much contamination is there for the major AI benchmarks? How much copyrighted data was used? How can we be sure that the training process was conducted as the authors say, whether from malfeasance or simple mistakes?

twelve401y ago

i have very little knowledge of any of this, but i had an impression that OpenAI was trained on commodity cloud hardware that's available for purchase/rent to anyone, including off-the-shelf GPUs from Nvidia and AMD? are those what you are referring to as "the firmware blob", or was there some other, more specialized and custom-built closed hardware involved?

jlouis1y ago

Turing completeness makes it a different problem.

TZubiri1y ago

"Do you believe that open source can exist on top of closed hardware? "

Yes, if Hardware is developed against standards shared by multiple manufacturers like amd64

dralley1y ago

It's not exactly practical to hand out the training material given the sheer quantity of data we're talking about.

andrewf1y ago

GPL v2 and earlier let you charge distribution costs (v3's language is more complicated). In the late 80s you could order an Emacs tape from the FSF for $150, which is about $430 today!

philipkglass1y ago

But they could provide training code and let people provide their own Common Crawl (or whatever other pile of training data), couldn't they?

stonogo1y ago

Yeah, no. We can move an arbitrary amount of data around the world at breakneck speed. Netflix does this for a living. It's not practical to hand out the training material because of the massive rampant copyright violations.

3 more replies

TZubiri1y ago

I understood that it meant source code in addition to weights, as in "publishing programming language code does not suffice as open source if you do not publish weights"

jmclnx1y ago· 6 in thread

No real information, just a marketing spiel.

echelon1y ago

> TL;DR - Red Hat views the minimum criteria for open source AI as open source-licensed model weights combined with open source software components.

That's not open source!

I'm going to call this "Wizard of Ozzing". You give away the spectacle of magic tricks, but none of the science and machinery to do it yourself. You're still hiding it all behind fake virtue signalling.

Open source in AI is open weights, open training scripts, open inference scripts, open training datasets, and associated helper utilities. Without the science lab, you cannot replicate the science.

Weights in a vacuum are not open. It's a trick.

tbrownaw1y ago

You have the preferred form for making modifications, and the relevant permissions to now get in trouble for it.

If you actually go look at the open source or few software definitions, that's what they're about - being able to make modifications.

Just like an open source software project doesn't need a public record of the rationale for all architectural decisions in order to qualify.

3 more replies

TZubiri1y ago

Red Hat is on your team and you are criticizing them for not doing enough, why not side with red hat against companies that don't even publish shit and still consider there models to be open?

The prevailing strategy for most companies is to publish some bullshit (like a cli or a model downloader) and call it open source, the bar is waaay low and we are just trying to raise it a couple of inches, it's not helpful to go for the throat and demand that it be raised to the sky!

1 more reply

ysofunny1y ago

i mean... it's IBM so what did we really expect?

globalnode1y ago

I was thinking of trying Fedora (currently using Debian) and this comment made me look up who owns red-hat. ibm now owns red-hat, and apparently vanguard owns a huge chunk of ibm. I wonder how much influence any of the sponsors have over what goes into the os and what direction it takes.

5 more replies

TZubiri1y ago

Or you don't understand the matter being discussed.

The classical definition of open source (cited in this release as Stallman's GPL definition of the preferred way in which to modify code) kind of breaks for ML programs.

This is a good update on the definition of open source from a quite reputable and influential FOSS source.

Your presumption that this article isn't sufficiently pure because in addition to being a company that does Free Software, it needs to do so without doing anything that resembles an ad or in any way ensures profitability is pedantic. If we had it your way the only open source software we would have would be mom's basement projects with shoestring budgets.

nine_k1y ago· 5 in thread

To me, the ML situation looks roughly like this.

(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.

(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.

(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.

(4) The actual training data is like the "source code" for the model, the input of the training "compiler".

Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.

AFAICT, Red Hat suggests that an "open-source ML model" must include (1), (2), and (3), so that the way the model has been trained is also open and reusable. I would say that it's great for scientific / applied progress, but I don't think it's "open source" proper. You get a binary blob and a compiler that can produce it and patch it, but you can't reproduce it the way the authors did.

Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.

I understand that the training set is massive, may contain a lot of data that can't be easily released publicly but that were licensed for the training purposes, and that training from scratch may cost millions, so releasing the (4) is very often infeasible.

I still think than (1) + (2) + (3) should not be called "open-source", because the source is not open. We need a different term, like "open structure" or something. It's definitely more open than something that's only available via an API, or as just weights, but not completely open.

pavelstoev1y ago

It is really just “open use” with detailed defined by the license type (MIT, etc)

nine_k1y ago

It's more than just use (inference), it does open some otherwise secret sauce of the training. It looks like there's no existing word / notion to exactly pinpoint this level of openness.

visarga1y ago

> Model weights are something like a bytecode blob

Can you update a bytecode blob as easily as finetuning and prompting models? It only takes a few input-output pairs and a few dollars worth of compute. They are more like an operating system and fine-tuning/prompting is like scripting on top. Similarly with Linux, you can download a LLM and run it locally.

anon3738391y ago

I think these endless debates about whether open-weights models qualify for a particular piece of terminology are... tiring. That said, I think the debates would benefit from discussing model training and model inference as two separate systems, because that's what they are. It's possible for model training to be closed-source while model inference is open-source, and vice versa.

Consider recent Mistral-Small release. The model training is almost totally closed-source. You can't replicate it. However, the model inference is fully open source: the code and weights are Apache licensed. Not only that, but Mistral released both the base model and the instruction-tuned model, so you have a good foundation to work from (the base model) should you prefer to do your own instruction tuning. In fact, Mistral has also open-sourced code to aid in the fine-tuning process as well. So you really have everything you need* to use and customize this inference system. And for most practical purposes, even if you had the original training data, it would be of no use to you.

It's also worth considering the inverse scenario. Suppose Meta were to release a big blob of pre-training data and scripts for Llama 405B, but no weights. This clearly qualifies as open source, but it is basically useless unless you have many millions of dollars to do something with it. It would do very little to democratize access to AI.

* Asterisk: There is one situation where having access to the original training data would be really, really useful -- model distillation. Nobody can match Meta's ability to distill Llama 405B into an 8B size, because that process works best when you can do it on identically distributed data.

pabs31y ago

For me, the attacks on ML that are possibly by poisoning the training data preclude considering models without freely distributable and modifiable training as open-source or libre models.

thrqka1y ago· 3 in thread

> We believe that these concepts can have the same impact on artificial intelligence

Where the concept is the exploitation of thousands of volunteers while repackaging their work. (I know that RedHat sponsors some people, sometimes to the detriment of projects, but a lot of it is not sponsored, especially when RedHat established itself.)

carlwgeorge1y ago

Red Hat pays more people to work on open source than any other company I'm aware of. I am one of these people. I challenge you to find a single open source project included in a Red Hat product that doesn't contain contributions from Red Hat employees. Maybe a few exist, but the vast majority include Red Hat contributions, because we contribute all over the open source ecosystem.

dralley1y ago

So, what, is HN against open source now?

7qW24A1y ago

More accurate to say that VC and the “startup ideology” has always been at the core of HN - it just so happened that aligned with OSS ideology during the ZIRP era.

lrvick1y ago· 2 in thread

I largely agree with these points, however it is an awkward position coming from Red Hat which is the best funded Linux distribution there is, and -still- not part of the reproducible builds project or investing in full source bootstrapping which means no one can exactly reproduce their published artifacts from source or prove they were not tampered with. (Same with Fedora)

Glass houses.

dralley1y ago

> (Same with Fedora)

https://docs.fedoraproject.org/en-US/reproducible-builds/

https://pagure.io/fedora-reproducible-builds/project/issues

https://fedoraproject.org/wiki/Releases/41/ChangeSet#Reprodu...

lrvick1y ago

From that first link "In the Fedora ecosystem, we cannot achieve reproducibility by the reproducible-builds.org definition"

Good to see they are slowly closing some blockers every year or so, but fundamentally today they do builds and signing centrally. There is no way to readily get the same hash of a central fedora supplied rpm locally.

Supply chain integrity is simply not a priority. They just trust the central build farm, or the compilers it uses, or everyone with access to it will never be compromised.

1 more reply

lutusp1y ago· 1 in thread

> More than three decades ago, Red Hat saw the potential of how open source development and licenses can create better software to fuel IT innovation. Thirty-million lines of code later, Linux not only developed to become the most successful open source software but the most successful software to date.

This seems to conflate Red Hat and Linux, as well as try to equate Red Hat with open-source. Red Hat is Linux, but Linux is not Red Hat, especially now that Red Hat has decided to restrict access to the RHEL source (https://www.itworldcanada.com/article/red-hat-decision-turns...).

And a pet grammatical peeve of mine:

> ... in some respects they serve a similar function to code.

I see this everywhere now -- IMHO it should be "... serve a function similar to code." Doesn't the original grate on your ear?

Also this is a Turing-test bot detector -- bots don't use this weird grammatical construction, only humans do.

mogwire1y ago

Restrict access to paying customers? Restrict access to companies violating an EULA to not redistribute packages?

That fact that people continue to spread this trope is amazing.

I pay for RHEL and I have a developer subscription for personal usage and the SRPMs are right there on their download portal.

Just because CIQ err Rocky has to take extra steps and violate Red Hat’s EULA doesn’t mean they restricted access.

davydm1y ago· 1 in thread

Red Hat opining about what is and isn't open is absolutely hilarious. Sorry, my dudes, you launched that ship in the wrong direction ages ago.

worthless-trash1y ago

What does redhat ship that isnt open source under the licenses my dude ?

1 more reply

pabs31y ago

I prefer the ML policy of the Debian Deep Learning Team.

https://salsa.debian.org/deeplearning-team/ml-policy/

nickandbro1y ago

Congrats NeuralMagic team on being acquired! I don't know if you know this, but I worked with you on discord a few times. Your team's always willing to go above and beyond with pushing out popular models in specific quant formats compatible with vLLM. And one of the few huggingface orgs that my boss can actually trust. Well deserved!

sciencesama1y ago

Look whos talking !! The whole fiasco with fedora and now they come to talk about opensource !!

j / k navigate · click thread line to collapse

64 comments

38 comments · 10 top-level

blackeyeblitzar1y ago· 10 in thread

Disappointing that red hat is basically validating open weights as open source, and excusing it by saying this:

bberenberg1y ago

PollardsRho1y ago

twelve401y ago

jlouis1y ago

Turing completeness makes it a different problem.

TZubiri1y ago

"Do you believe that open source can exist on top of closed hardware? "

Yes, if Hardware is developed against standards shared by multiple manufacturers like amd64

dralley1y ago

It's not exactly practical to hand out the training material given the sheer quantity of data we're talking about.

andrewf1y ago

GPL v2 and earlier let you charge distribution costs (v3's language is more complicated). In the late 80s you could order an Emacs tape from the FSF for $150, which is about $430 today!

philipkglass1y ago

But they could provide training code and let people provide their own Common Crawl (or whatever other pile of training data), couldn't they?

stonogo1y ago

3 more replies

TZubiri1y ago

I understood that it meant source code in addition to weights, as in "publishing programming language code does not suffice as open source if you do not publish weights"

jmclnx1y ago· 6 in thread

No real information, just a marketing spiel.

echelon1y ago

> TL;DR - Red Hat views the minimum criteria for open source AI as open source-licensed model weights combined with open source software components.

That's not open source!

Open source in AI is open weights, open training scripts, open inference scripts, open training datasets, and associated helper utilities. Without the science lab, you cannot replicate the science.

Weights in a vacuum are not open. It's a trick.

tbrownaw1y ago

You have the preferred form for making modifications, and the relevant permissions to now get in trouble for it.

If you actually go look at the open source or few software definitions, that's what they're about - being able to make modifications.

Just like an open source software project doesn't need a public record of the rationale for all architectural decisions in order to qualify.

3 more replies

TZubiri1y ago

Red Hat is on your team and you are criticizing them for not doing enough, why not side with red hat against companies that don't even publish shit and still consider there models to be open?

1 more reply

ysofunny1y ago

i mean... it's IBM so what did we really expect?

globalnode1y ago

5 more replies

TZubiri1y ago

Or you don't understand the matter being discussed.

The classical definition of open source (cited in this release as Stallman's GPL definition of the preferred way in which to modify code) kind of breaks for ML programs.

This is a good update on the definition of open source from a quite reputable and influential FOSS source.

nine_k1y ago· 5 in thread

To me, the ML situation looks roughly like this.

(1) Model weights are something like a bytecode blob. You can run it in a conformant interpreter, and be able to do inference.

(2) Things like llama.cpp are the "bytecode interpreter" part, something that can load the weights and run inference.

(3) The training setup is like a custom "compiler" which turns training data to the "bytecode" of the model weights.

(4) The actual training data is like the "source code" for the model, the input of the training "compiler".

Currently (2) is well-served by a number of open-source offerings. (1) is what is usually released when a new model is released. (1) + (2) give the ability to run inference independently.

Releasing the training set, the (4), to my mind, would be crucial for the model to be actually "open source" in the way an open-source C program is.

pavelstoev1y ago

It is really just “open use” with detailed defined by the license type (MIT, etc)

nine_k1y ago

It's more than just use (inference), it does open some otherwise secret sauce of the training. It looks like there's no existing word / notion to exactly pinpoint this level of openness.

visarga1y ago

> Model weights are something like a bytecode blob

anon3738391y ago

pabs31y ago

For me, the attacks on ML that are possibly by poisoning the training data preclude considering models without freely distributable and modifiable training as open-source or libre models.

thrqka1y ago· 3 in thread

> We believe that these concepts can have the same impact on artificial intelligence

carlwgeorge1y ago

dralley1y ago

So, what, is HN against open source now?

7qW24A1y ago

More accurate to say that VC and the “startup ideology” has always been at the core of HN - it just so happened that aligned with OSS ideology during the ZIRP era.

lrvick1y ago· 2 in thread

Glass houses.

dralley1y ago

> (Same with Fedora)

https://docs.fedoraproject.org/en-US/reproducible-builds/

https://pagure.io/fedora-reproducible-builds/project/issues

https://fedoraproject.org/wiki/Releases/41/ChangeSet#Reprodu...

lrvick1y ago

From that first link "In the Fedora ecosystem, we cannot achieve reproducibility by the reproducible-builds.org definition"

Supply chain integrity is simply not a priority. They just trust the central build farm, or the compilers it uses, or everyone with access to it will never be compromised.

1 more reply

lutusp1y ago· 1 in thread

And a pet grammatical peeve of mine:

> ... in some respects they serve a similar function to code.

I see this everywhere now -- IMHO it should be "... serve a function similar to code." Doesn't the original grate on your ear?

Also this is a Turing-test bot detector -- bots don't use this weird grammatical construction, only humans do.

mogwire1y ago

Restrict access to paying customers? Restrict access to companies violating an EULA to not redistribute packages?

That fact that people continue to spread this trope is amazing.

I pay for RHEL and I have a developer subscription for personal usage and the SRPMs are right there on their download portal.

Just because CIQ err Rocky has to take extra steps and violate Red Hat’s EULA doesn’t mean they restricted access.

davydm1y ago· 1 in thread

Red Hat opining about what is and isn't open is absolutely hilarious. Sorry, my dudes, you launched that ship in the wrong direction ages ago.

worthless-trash1y ago

What does redhat ship that isnt open source under the licenses my dude ?

1 more reply

pabs31y ago

I prefer the ML policy of the Debian Deep Learning Team.

https://salsa.debian.org/deeplearning-team/ml-policy/

nickandbro1y ago

sciencesama1y ago

Look whos talking !! The whole fiasco with fedora and now they come to talk about opensource !!

j / k navigate · click thread line to collapse