undefined | Better HN

0 pointslolinder1y ago0 comments

> If it is not reproducible, then it is not open. If it is reproducible, then it is open.

You're applying reproducibility unevenly, though.

The Linux kernel source code cannot feasibly be reproduced, but it can be copied and modified. The Mistral weights cannot feasibly be reproduced, but they can be copied and modified. Why is the kernel code open source while the Mistral weights are not?

Reproducibility is clearly not the deciding factor.

0 comments

fragmede1y ago

The Linux kernel is considered Open Source because (among other things) the compiled kernel binary that is used to boot a computer can be reproduced from provided source code.

source code -> compile -> kernel binary. That binary is what can be reproduced, given the source code.

We don't have the equivalent for Mistral:

source code (+ training data) -> training -> weights

lolinderOP1y ago

So people have said, but as I've noted I disagree with the characterization that training is equivalent to compilation. Even the companies that can afford to train a foundation model do so once and then fine-tune it from there to modify it. They only start a new training run when they're building a brand new model with totally different characteristics (such as a different parameter count).

Training is too expensive for the training data to be the preferred form for making modifications to the work. Given that, the weights themselves are the closest thing these things have to "source code".

And this is where the reproducibility argument falls apart: on what basis can we insist that the preferred form for modifying an LLM (the weights) must be reproducible to be open source but the preferred form for modifying a piece of regular software (the code) can be open sourced as is, with none of the processes used to produce the code?

fragmede1y ago

Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

In order for the weights to take all the training data and embed it in the model, by definition, some data must be lost. That data can't be recovered, no matter how much you fine tune the model. Because we can't, we don't know how alignment gets set, or the extent of it.

The closet thing these things have to source code is the source code and training data used to create the model. Because that's what's used to created the model. How big a system is necessary to train it doesn't factor in. It used to take many days to compile the Linux kernel, and many people at the time didn't have access to systems that could even compile it.

lolinderOP1y ago

> Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

First, licenses matter. Photoshop.exe is closed source first and foremost because the license says so.

Second and more importantly for this discussion, Adobe doesn't prefer to work with hexedit, they prefer to work with the source code.

OpenAI prefers to fine tune their existing models rather than train new ones. They fine tune regularly, and have only trained from scratch four times total, with each of those being a completely new model, not a modification.

That means the weights of an LLM are the preferred form for modification, which meets the GPL's definition of 'source code':

> The “source code” for a work means the preferred form of the work for making modifications to it.

1 more reply

SOLAR_FIELDS1y ago

Interesting take. You appear to be defining reproducibility to be something like “could I write this source code again myself”. But no one I know uses the term reproducible in the way you’re saying. Everyone I know, including myself, takes reproducibility in this context to mean “if given source code, I can produce a binary /executable that is identical to the one that is produced by some other party when built the same way”

Now I get that “Identical” is a bit more nebulous when it comes to LLMs due to their inherent nondeterminism, but let’s take it to mean the executable itself, not the results produced by the executable.

lolinderOP1y ago

> You appear to be defining reproducibility to be something like “could I write this source code again myself”.

No, I'm using the strict definition "capable of being reproduced", where reproduce means "to cause to exist again or anew". In and of itself the word doesn't comment on whether you're starting from source code or anything else, it just says that something must be able to be created again.

Yes, in the context of compilation this tends to refer to reproducible builds (which is a whole rabbit hole of its own), but here we're not dealing with two instances of compilation, so it's not appropriate to use the specialized meaning. We're dealing with two artifacts (a set of C files and a set of weights) that were each produced in different ways, and we're asking whether each one can be reproduced exclusively from data that was released alongside the artifact. The answer is that no, neither the source files or the weights can be reproduced given data that was released alongside them.

So my question remains: on what basis can we say that the weights are not open source but the C files are? Neither can be reproduced from data released alongside them, and both are the preferred form which the original authors would choose to make modifications to. What distinguishes them?

SOLAR_FIELDS1y ago

I’ll go ahead and call this a false equivalency because the amount of work required to get “pretty close” to compiling a binary that looks something like the Linux kernel is pretty achievable. Not so for these models. I know my way around gcc and llvm enough to be able to compile something that will work mostly like the Linux kernel in some reasonable amount of time.

Now I know it seems like I’m taking the opposite side of my original take here but come on - you can’t really genuinely believe that because I can’t produce a byte for byte representation of the Linux kernel immediately even if it behaves 99.999% the same that somehow that is even remotely the same as not being able to reproduce an “open” LLM.

lolinderOP1y ago

All I'm saying is that reproducibility of the released primary artifact—be it source or weights—is not actually a factor in whether we consider something to be open source. Regardless of whether you believe you could rewrite the Linux kernel from scratch, you don't consider the Linux kernel to be open source because you can rewrite it.

It's open source because they licensed the preferred form of the work for making modifications under a FOSS license. That's it. Reproducibility of that preferred form from scratch doesn't factor into it.

1 more reply

nrnrjrjrj1y ago

This is an annoyingly good argument. Yet my gut says a bunch of floating point numbers isn't open source code.

I think it goes to show how hard it is to make analogies between thw two fields.

Maybe it is just not source at all. Open or closed.

It is data. Like a csv of addresses and coordinates that were collated from different sources that say are no longer available.

It is a very philosophical topic. What if machines got faster and you could train Llama in 5 minutes, and an SSD could hold all the training data. Then it would feel like a compiled artifact more than data. Not releasing the training data would then feel like hiding something.

j / k navigate · click thread line to collapse

0 comments

fragmede1y ago

The Linux kernel is considered Open Source because (among other things) the compiled kernel binary that is used to boot a computer can be reproduced from provided source code.

source code -> compile -> kernel binary. That binary is what can be reproduced, given the source code.

We don't have the equivalent for Mistral:

source code (+ training data) -> training -> weights

lolinderOP1y ago

fragmede1y ago

Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

lolinderOP1y ago

> Just because I can hex edit Photoshop.exe to say what I want in the about dialog doesn't make it open source, even if it is faster and easier to hexedit it than it is to recompile from source.

First, licenses matter. Photoshop.exe is closed source first and foremost because the license says so.

Second and more importantly for this discussion, Adobe doesn't prefer to work with hexedit, they prefer to work with the source code.

That means the weights of an LLM are the preferred form for modification, which meets the GPL's definition of 'source code':

> The “source code” for a work means the preferred form of the work for making modifications to it.

1 more reply

SOLAR_FIELDS1y ago

lolinderOP1y ago

> You appear to be defining reproducibility to be something like “could I write this source code again myself”.

SOLAR_FIELDS1y ago

lolinderOP1y ago

1 more reply

nrnrjrjrj1y ago

This is an annoyingly good argument. Yet my gut says a bunch of floating point numbers isn't open source code.

I think it goes to show how hard it is to make analogies between thw two fields.

Maybe it is just not source at all. Open or closed.

It is data. Like a csv of addresses and coordinates that were collated from different sources that say are no longer available.

j / k navigate · click thread line to collapse