undefined | Better HN

0 pointsstavros1y ago0 comments

"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.

0 comments

larodi1y ago

Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?

proteal1y ago

Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.

stavrosOP1y ago

That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.

nightski1y ago

Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.

2 more replies

frabcus1y ago

I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.

nine_k1y ago

Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.

verdverm1y ago

Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.

piperswe1y ago

You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.

krisoft1y ago

And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.

verdverm1y ago

I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model

1 more reply

sigmoid101y ago

>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.

stavrosOP1y ago

Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.

plausibility1y ago

It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.

politelemon1y ago

It's the literal (figurative) nobody rather than the literal (literal) nobody.

mattnewton1y ago

There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm

WithinReason1y ago

If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?

guappa1y ago

Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.

sharpshadow1y ago

If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.

guappa1y ago

It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.

llm_trw1y ago

This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.

TeMPOraL1y ago

And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights

llm_trw1y ago

Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.

1 more reply

saurik1y ago

Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?

llm_trw1y ago

It means nothing because LLMs aren't software.

Phelinofist1y ago

Do they not run on a computer?

1 more reply

the8thbit1y ago

"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).

blackeyeblitzar1y ago

There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796

_heimdall1y ago

Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.

swatcoder1y ago

The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.

llm_trw1y ago

>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.

Tepix1y ago

If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.

solarmist1y ago

The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.

_heimdall1y ago

Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.

TeMPOraL1y ago

Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.

derefr1y ago

Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.

echoangle1y ago

I think no one would claim that “Doom” is open source though, if that’s the situation.

1 more reply

saurik1y ago

The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)

_heimdall1y ago

The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.

saurik1y ago

I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?

achrono1y ago

> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.

gowld1y ago

https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.

_heimdall1y ago

I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.

the8thbit1y ago

I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.

stavrosOP1y ago

Data is to models what code is to software.

_heimdall1y ago

I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.

Xelynega1y ago

Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.

1 more reply

croemer1y ago

But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.

Aeolun1y ago

I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?

jlokier1y ago

My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.

jijji1y ago

yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/

yencabulator1y ago

And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.

j / k navigate · click thread line to collapse

0 comments

larodi1y ago

how's that even remotely similar to open source?

proteal1y ago

stavrosOP1y ago

That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

nightski1y ago

Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.

2 more replies

frabcus1y ago

I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.

nine_k1y ago

Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

verdverm1y ago

piperswe1y ago

You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.

krisoft1y ago

And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.

verdverm1y ago

I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

1 more reply

sigmoid101y ago

>Nobody does open source LLMs.

stavrosOP1y ago

Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.

plausibility1y ago

It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.

politelemon1y ago

It's the literal (figurative) nobody rather than the literal (literal) nobody.

mattnewton1y ago

There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm

WithinReason1y ago

If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?

guappa1y ago

Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.

sharpshadow1y ago

If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.

guappa1y ago

It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.

llm_trw1y ago

This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.

TeMPOraL1y ago

And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights

llm_trw1y ago

Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.

1 more reply

saurik1y ago

Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?

llm_trw1y ago

It means nothing because LLMs aren't software.

Phelinofist1y ago

Do they not run on a computer?

1 more reply

the8thbit1y ago

"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).

blackeyeblitzar1y ago

There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796

_heimdall1y ago

Why is the dataset required for it to be open source?

swatcoder1y ago

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

llm_trw1y ago

>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

Tepix1y ago

If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.

solarmist1y ago

The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.

_heimdall1y ago

TeMPOraL1y ago

derefr1y ago

Compare/contrast:

echoangle1y ago

I think no one would claim that “Doom” is open source though, if that’s the situation.

1 more reply

saurik1y ago

_heimdall1y ago

saurik1y ago

I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?

achrono1y ago

> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.

gowld1y ago

https://opensource.org/osd

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.

_heimdall1y ago

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.

the8thbit1y ago

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

stavrosOP1y ago

Data is to models what code is to software.

_heimdall1y ago

I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.

Xelynega1y ago

Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

1 more reply

croemer1y ago

But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.

Aeolun1y ago

I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?

jlokier1y ago

My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.

jijji1y ago

[0] https://ai.meta.com/blog/meta-llama-3-1/

yencabulator1y ago

It's the computation that is costly.

j / k navigate · click thread line to collapse