undefined | Better HN

0 pointscrazygringo17d ago0 comments

You might wish that were true, but there are very strong arguments it's not. Training on copyleft licensed code is not a license violation. Any more than a person reading it is. In copyright terms, it's such an extreme transformative use that copyright no longer applies. It's fair use.

But agreed that we're waiting for a court case to confirm that. Although really, the main questions for any court cases are not going to be around the principle of fair use itself or whether training is transformative enough (it obviously is), but rather on the specifics:

1) Was any copyrighted material acquired legally (not applicable here), and

2) Is the LLM always providing a unique expression (e.g. not regurgitating books or libraries verbatim)

And in this particular case, they confirmed that the new implementation is 98.7% unique.

0 comments

pessimizer17d ago

> Training on copyleft licensed code is not a license violation. Any more than a person reading it is. In copyright terms, it's such an extreme transformative use that copyright no longer applies. It's fair use.

This is just an assertion that you're making. There's no argument here. I'm aware that this is also an assertion that some judges have made.

My claim is that LLMs are not human, therefore when you apply words like "training" to them, you're only doing it metaphorically. It's no more "training" than copying code to a different hard drive is training that hard drive. And it's no more "transformative" than rar'ing or zipping the code, then unzipping it. I can't sell my jpgs of pngs I downloaded from Getty.

I have no idea how LLMs can be considered transformative work that immunizes me from owing the least bit of respect to the source material, but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved. I might even owe money to the people who wrote the filters, depending on the licensing.

> 98.7% unique.

This doesn't mean anything. This is a meaningless arrangement of words. The way we figure out things are piracy is through provenance, not bizarre ad hoc measurements. If I read a book in Spanish and rewrite it in English, it doesn't suddenly become mine even though it's 96.6492387% unique. Not even if I drop a few chapters, add in a couple of my own, and change the ending.

crazygringoOP16d ago

> This is just an assertion that you're making. There's no argument here.

...OK? Was somebody asking me for an "argument"? I'm just stating how things are currently understood.

> And it's no more "transformative" than rar'ing or zipping the code, then unzipping it.

That's obviously false, so I'm not sure what to tell you.

> but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved

You don't, actually, if they're no longer recognizable -- which they wouldn't be after "9000 filters". I don't know where you got that idea that you'd still owe money. And I've certainly never heard of an audio filter license that was contingent on commerical distribution.

> This doesn't mean anything. This is a meaningless arrangement of words.

Statistics are meaningful. Obviously you need to look at the actual identical lines. But if they're a bunch of trivial things like initializing variables with obvious names, then they don't count for much. And if you're adhering to the same API, you would expect to have some small percentage of lines happen to match. So the fact that this is <2%, as opposed to 40%, is hugely significant as a first step of analysis.

I suggest you might find conversations here on HN more productive if you soften your tone a bit. Saying things like "this is just an assertion that you're making" or "this is a meaningless arrangement of words" is not generally going to make people want to respond to you.

jazzyjackson17d ago

Transformative is not the only component of determining fair use, there’s also the economic displacement aspect. If you’re doing a book report and include portions of the original (or provide an interface for viewing portions à la Google Books) you aren’t a threat to the original authors ability to make a living.

If you’ve used copyrighted books and turned them into a free write-a-book machine, you are suddenly using the authors own works against them, in a way that a judge might rule is not very fair.

“ Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.”

https://www.copyright.gov/fair-use/

crazygringoOP17d ago

Sure. But it seems very difficult to argue that LLM's are harming that ability to make a living in a direct way.

This is for the same reason that search results or search snippets aren't deemed to harm creators according to copyright. Yes there might be some percentage lost of sales. And truly, people may be buying less JavaScript tutorial books now that LLM's can teach you JavaScript or write it for you. But the relation is so indirect, there's very little chance a court would accept the argument.

Because what the LLM is doing is reading tons of JavaScript and JavaScript tutorials and resources online, and producing its own transformed JavaScript. And the effect of any single JavaScript tutorial book in its training set is so marginal to the final result, there's no direct effect.

And the reason this makes sense is that it's no different from a teacher reading 20 books on JavaScript and then writing their own that turns out to be a best-seller. Yes, it takes away from the previous best-sellers. But that's fine, because they're not copying any of the previous works directly. They're transforming the facts they learned into a new synthesis.

gspr17d ago

> Training on copyleft licensed code is not a license violation. Any more than a person reading it is.

Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.

> In copyright terms, it's such an extreme transformative use that copyright no longer applies.

Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

NewsaHackO17d ago

>Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?

I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

gspr17d ago

> I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.

It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.

1 more reply

crazygringoOP17d ago

> We do not have to grant machines the same.

No we don't have to, but so far we do, because that's the most legally consistent. If you want to change that, you're going to need to pass new laws that may wind up radically redefining intellectual property.

> Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?

Of course it has, if the transformation is extreme, as it appears to be here. If I memorize the lyrics to a bunch of love songs, and then write my own love song where every line is new, nobody's going to successfully sue me just because I can sing a bunch of other songs from memory.

Also, it's not even remotely clear that the LLM can produce the training data near-verbatim. Generally it can't, unless it's something that it's been trained on with high levels of repetition.

munk-a17d ago

I want to briefly pick at this:

> you're going to need to pass new laws that may wind up radically redefining intellectual property

You're correct that this is one route to resolving the situation, but I think it's reasonable to lean more strongly into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself that would draw a pretty clear distinction between human creativity and reuse and LLMs.

1 more reply

Paradigma1115d ago

"Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?...."

So do 10 000 chimpanzees on typewriters.

madeofpalk17d ago

A human reading a unit of work is not a “copy”. I’m pretty sure our legal systems agree that thought or sight is not copying something.

Training an LLM inherently requires making a copy of the work. Even the initial act of loading it from the internet and copying it into memory to then train the LLM is a copy that can be governed by its license and copyright law

cortesoft17d ago

I think you are confusing two different meanings of the word ‘copy’. The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

madeofpalk17d ago

It absolutely does! In law and the courts

> The court held that making RAM copies as an essential step in utilizing software was permissible under §117 of the Copyright Act even if they are used for a purpose that the copyright holder did not intend.

https://en.wikipedia.org/wiki/Vault_Corp._v._Quaid_Software_....

kg17d ago

> The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

IIRC this exact argument was made in the Blizzard vs bnetd case, wasn't it? Though I can't find confirmation on whether that argument was rejected or not...

crazygringoOP17d ago

> Training an LLM inherently requires making a copy of the work.

But that's not relevant here. Because the copyleft license does not prohibit that (and it's not even clear that any license can prohibit it, as courts may confirm it's fair use, as most people are currently assuming). That's why I noted under (1) that it's not applicable here.

munk-a17d ago

It's absolutely prohibited to copy and redistribute for commercial purposes materials that you're unlicensed to do so with. This isn't an issue when it comes to the copy-left scenario (though it may potentially enforce transitive licensing requirements on the copier that LLM runners don't want to follow) but it is a huge issue that has come up with LLM training.

LLM training involves ingesting works (in a potentially transformative process) and partially reproduce them - that's a generally restricted action when it comes to licensing.

1 more reply

joquarky17d ago

We've drifted a bit off the road from "To promote the progress of science and useful arts"

Copyrightest17d ago

The big difference between people reading code and LLMs reading code is that people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement, and it's almost impossible for users to tell when it happens.

BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.

crazygringoOP17d ago

> people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement

That's not relevant, because you can still sue the person using the LLM and publishing the repository. Legal liability is completely unchanged.

alterom17d ago

>Legal liability is completely unchanged.

It's changed completely, from your own example.

If you comission art from an artist who paints a modified copy of Warhol's work, the artist is liable (even if you keep that work private, for personal use).

If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.

I'm not going to argue about the merits of creativity here, or that someone putting a prompt into ChatGPT considers themselves an artist.

That's irrelevant. The work is created on OpenAI servers, by the LLMs hosted there, and is then distributed to whoever wrote the prompt.

Models run locally are distributed by whoever trained them.

If you train a model on whatever data you legally have access to, and produce something for yourself, it's one thing.

Distribution is where things start to get different.

1 more reply

satvikpendem17d ago

You can sue the company making the LLM, which is what many have done.

j / k navigate · click thread line to collapse

0 comments

pessimizer17d ago

This is just an assertion that you're making. There's no argument here. I'm aware that this is also an assertion that some judges have made.

> 98.7% unique.

crazygringoOP16d ago

> This is just an assertion that you're making. There's no argument here.

...OK? Was somebody asking me for an "argument"? I'm just stating how things are currently understood.

> And it's no more "transformative" than rar'ing or zipping the code, then unzipping it.

That's obviously false, so I'm not sure what to tell you.

> but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved

> This doesn't mean anything. This is a meaningless arrangement of words.

jazzyjackson17d ago

If you’ve used copyrighted books and turned them into a free write-a-book machine, you are suddenly using the authors own works against them, in a way that a judge might rule is not very fair.

https://www.copyright.gov/fair-use/

crazygringoOP17d ago

Sure. But it seems very difficult to argue that LLM's are harming that ability to make a living in a direct way.

gspr17d ago

> Training on copyleft licensed code is not a license violation. Any more than a person reading it is.

Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.

> In copyright terms, it's such an extreme transformative use that copyright no longer applies.

NewsaHackO17d ago

gspr17d ago

It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.

1 more reply

crazygringoOP17d ago

> We do not have to grant machines the same.

> Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?

Also, it's not even remotely clear that the LLM can produce the training data near-verbatim. Generally it can't, unless it's something that it's been trained on with high levels of repetition.

munk-a17d ago

I want to briefly pick at this:

> you're going to need to pass new laws that may wind up radically redefining intellectual property

1 more reply

Paradigma1115d ago

"Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?...."

So do 10 000 chimpanzees on typewriters.

madeofpalk17d ago

A human reading a unit of work is not a “copy”. I’m pretty sure our legal systems agree that thought or sight is not copying something.

cortesoft17d ago

I think you are confusing two different meanings of the word ‘copy’. The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

madeofpalk17d ago

It absolutely does! In law and the courts

https://en.wikipedia.org/wiki/Vault_Corp._v._Quaid_Software_....

kg17d ago

> The fact that a computer loads it into memory does not make it automatically a ‘copy’ in the copyright sense.

IIRC this exact argument was made in the Blizzard vs bnetd case, wasn't it? Though I can't find confirmation on whether that argument was rejected or not...

crazygringoOP17d ago

> Training an LLM inherently requires making a copy of the work.

munk-a17d ago

LLM training involves ingesting works (in a potentially transformative process) and partially reproduce them - that's a generally restricted action when it comes to licensing.

1 more reply

joquarky17d ago

We've drifted a bit off the road from "To promote the progress of science and useful arts"

Copyrightest17d ago

BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.

crazygringoOP17d ago

> people have legal liability and LLMs do not. You can't sue an LLM for copyright infringement

That's not relevant, because you can still sue the person using the LLM and publishing the repository. Legal liability is completely unchanged.

alterom17d ago

>Legal liability is completely unchanged.

It's changed completely, from your own example.

If you comission art from an artist who paints a modified copy of Warhol's work, the artist is liable (even if you keep that work private, for personal use).

If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.

I'm not going to argue about the merits of creativity here, or that someone putting a prompt into ChatGPT considers themselves an artist.

That's irrelevant. The work is created on OpenAI servers, by the LLMs hosted there, and is then distributed to whoever wrote the prompt.

Models run locally are distributed by whoever trained them.

If you train a model on whatever data you legally have access to, and produce something for yourself, it's one thing.

Distribution is where things start to get different.

1 more reply

satvikpendem17d ago

You can sue the company making the LLM, which is what many have done.

j / k navigate · click thread line to collapse