undefined | Better HN

0 pointsCharlesW2y ago0 comments

It's not a dumb question, and the answer is "yes".

0 comments

A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.

CharlesWOP2y ago

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor2y ago

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

2 more replies

logicchains2y ago

https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.

nabakin2y ago

Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.

dudus2y ago

If you release that instead of the binary weights you can be both more open and less useful for users. Fun

zeroCalories2y ago

Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.

schoen2y ago

Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).

zeroCalories2y ago

Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.

j / k navigate · click thread line to collapse

0 comments

simonw2y ago

CharlesWOP2y ago

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor2y ago

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

2 more replies

logicchains2y ago

https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.

nabakin2y ago

Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.

dudus2y ago

If you release that instead of the binary weights you can be both more open and less useful for users. Fun

zeroCalories2y ago

Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.

schoen2y ago

Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).

zeroCalories2y ago

j / k navigate · click thread line to collapse