undefined | Better HN

https://www.law.cornell.edu/uscode/text/17/106

The purpose of copyright is it progress the arts and sciences. Not to guarantee profit. Guaranteeing profit is just the way we encourage people to progress the arts and sciences.

That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.

If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.

moregrist1y ago

> That is why so called derivative works are allowed (and even encouraged).

Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”

It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.

The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.

dragonwriter1y ago

> That is why so called derivative works are allowed (and even encouraged).

Derivative works are not "allowed (and even encouraged)" without a license from the copyright holder. Creating a derivative work is an exclusive right of the copyright holder just like making verbatim copies and requires a license for anyone else, unless an exception to copyright protection (like fair use) applies.

throwaway2901y ago

Derivative works are not generally allowed in many jurisdictions. Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it

(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)

moregrist1y ago

> Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

So… it’s complicated. This is one of the weird areas where music copyright and other copyright seem to differ in the US.

In the US the situation is complex and there are a lot of weird special interests [0], but generally a composer/author of a song has the right to decide who first records and releases the song, but after the first recording covers require a mechanical license, which is compulsory (ie: the author cannot object).

In music there are _a lot_ of special cases and different rights are decided with different kinds of licenses, some of which are compulsory. I think it’s an area that doesn’t make for good analogies with copyright in other media.

0: https://en.m.wikipedia.org/wiki/Mechanical_license

throwaway2901y ago

> covers require a mechanical license, which is compulsory

Which is compulsory for the performer too.

A derivative work like cover is sort of acceptable when it's performed by a person live for some audience (grey area but twitch sort of allows it. with a bunch of rules). As soon as you want to publish it you MUST have a license. And chatbot is a derivative work totally not performed live by a person for some audience

I saw great tracks that got taken down from all legal channels because they featured a sample from another song. Sometimes they remained up but mostly they were taken down. It is fully original publisher's discretion...

e.g. https://torrentfreak.com/operator-of-popcorn-time-info-site-...

You are absolutely right. I should have phrased that differently. Derivative work is a legal term, but I misused it above. I should have either used another term or been clearer.

If the work is "derivative" in the legal sense it is copyrighted, and you may not create derivative works without the copyright holders permission.

What I should have said is that simply being inspired by a work or copying unprotectable elements (like facts or ideas) does not create a derivative work.

For example, if ChatGPT were to generate Star Wars, except with Dookies instead of Wookies, that might be illegal. If it were to learn what a spaceship is from Star Wars and then create something substantially new it would not. The key is is that it must not be substantially similar to the original. You must add enough value that it becomes something new, not just rehash the original.

throwaway2901y ago

Learning is like inspiration, it's something humans do. Don't get fooled by "learning" in "machine learning", it doesn't mean machine becomes like human in legal sense...

ndsipa_pomu1y ago

> The purpose of copyright is it progress the arts and sciences. Not to guarantee profit.

That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death

__loam1y ago

You do not understand fair use lol

nilsbunger1y ago

There’s something called a substantive transformation test in copyright law. When you write a summary of a book, you don’t infringe on copyright because it’s a “substantial transformation”. This goes along with the idea that you can copyright the text but not the ideas it expresses.

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

TheOtherHobbes1y ago

No transformation is needed.

The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.

leovander1y ago

> have been bankrupted for doing exactly this.

Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).

It's the hosting that gets you, not the act of downloading it.

ndsipa_pomu1y ago

> It's the hosting that gets you, not the act of downloading it.

However, people have been prosecuted for not even hosting a torrent, but merely providing a link to where people can find it.

I would like to expand on this, since it seems to be a common misunderstanding. Lets imagine a hypothetical situation where one friend loans a book to another, who then makes a copy of it.

The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.

The borrower is well within his rights to accept the book, and as the current owner he is even allowed to make a copy of the book (see the famous TIVO case). Making this illegal would end backups and format/time shifting.

When the borrower returns the book, he keeps the copy. Oh no! Surely he must now become a criminal? Nope. Possessing an unauthorized copy is also not illegal, despite what many copyright holders would like you to believe. Making this illegal would also criminalize a lot of legitimate format/time shifting, again see the famous TIVO case.

If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.

Nothing about AI changes any of this.

3 more replies

jayd161y ago

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

spwa41y ago

It's COPYright. It has to be very close to the original to be covered by copyright. Hence the name.

__loam1y ago

They copied the work when they made the training set.

[1] https://www.iusmentis.com/copyright/software/rights/

mrgoldenbrown1y ago

Even if you argue the LLM's are merely summarizing content, they still had to illegally download that content in the first place. The model can't read and simmarize the texts unless the text was illegally downloaded and copied. Piracy isn't suddenly legal just because you promise to delete the movie you downloaded after watching it.

triceratops1y ago

The counterargument to that is model training is impossible without making copies. That's not true for humans.

Workaccount21y ago

That's not really true. Models train (in greatly simplified way) by being shown an excerpt and being told to guess the next token from the excerpt. They push around their weights until the token they output matches the next token in the excerpt. Then the excerpt is no longer needed. You can think of it like the article is loaded, the LLM plays this token guessing game through it, then the article is discarded. On the face of it this is what happens, but it gets hairier depending on how exactly this process is done. But it is seemingly not far removed from how humans consume content (acquire, read, discard), hence the legal blur.

triceratops1y ago

> by being shown an excerpt [of copyrighted material]

How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.

> it is seemingly not far removed from how humans consume content

Except that humans don't make full copies to RAM, or disk or paper.

2 more replies

codedokode1y ago

It is different thing. When you copy data into computer's RAM, that might be copying as defined in law [1]:

> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.

For comparison, when a human looks at the letters, there is no copying.

Also, models can reproduce text verbatim which proves that they store it.

So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.

realusername1y ago

It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.

The computer model is working differently of course but functionally it's the same idea.

__loam1y ago

God I hate this conversation so much. These cases have nothing to do with how the brain works.

meta_ai_x1y ago

completely different scenarios. A pirated movie is marketed/sold as a copy of something, which is not fair use. An LLM just remembers/get inspired by what it consumes

xyzzy_plughOP1y ago

I don't believe that's correct. The existence of filters to block potentially copyrighted materials contained in LLM outputs proves that they don't just "get inspired."

It seems like it is very much a matter of fidelity.

> An LLM just remembers/get inspired by what it consumes

As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.

Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.

Workaccount21y ago

They are in no way compression algorithms. They can be modeled like that in the same way you can model humans as lossy compression algorithms.

You would never use a human to backup your financial reports, but the human might be able to give a good overview. You would never use an LLM to backup your financial reports, but they might be able to give a good overview.

AI training data is disposable. There is nothing that could be called a compression algorithm that disposes all of the data you put into it. AI uses training data as examples of what the next token in a token sequence is. The examples are disposable reference points, not the model itself. That's how you get image models that are 20GB in size despite training on 20PB of data. It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.

simion3141y ago

>It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.

You can compress 20PB of text to 20Gb or even less, if input is super repetitive. So the same with images, if 50% of the images are cats then you learn how to represent the cat pixels with a few vectors and then you could represent all the cats int he world doing all possible cat actions.

But please have the courage to respond to this, when the AI is caught regurgitating the exact text from a popular book, the exact verses from a poem, the exact code function from some code , then how can you defend that is not memorizing things? If a human uses my poem(after they read it) and signs his name under it would you defend them?

0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...

> They are in no way compression algorithms.

I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).

From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.

Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].

1. https://www.inference.org.uk/itprnn/book.pdf

2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

amelius1y ago

> I'm legitimately curious what the test is here.

The test is if a judge says it is fair use, nothing else.

The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.

realusername1y ago

It's the same as a book or a film review, you can't get the film or the book back from it but the original material is still needed to produce it.

Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples

__loam1y ago

Movie reviews are fair use because they don't compete with the original work.

realusername1y ago

The AI models also do not compete with the original work.

Nobody is going to try to extract pages by pages a book from ChapGPT, let's be realistic. (and you can't anyways)

__loam1y ago

Half the point of this crap is labor automation. How does it not compete? If I write a bunch of books and then you use my work to make a machine that writes books in my style, you are using my labor to directly compete with me.

https://www.gnu.org/philosophy/not-ipr.en.html

ilikehurdles1y ago

If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

The model isn’t storing the book.

mcny1y ago

> If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

I think that is the center of the conversation. What does it mean for a computer to "understand"? If I wrote some code that somehow transformed the text of the book and never return the verbatim text but somehow modified the output, I would likely not be spared because the ruling will likely be my transformation is "trivial".

Personally, I think we have several fixes we need to make:

  1. Abolish the CFAA. 
  2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason. 
  3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada. 
  4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works. 
  5. I am sure I am missing some stuff here.

For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.

waynesonfire1y ago

I just happen do read the Phoenix Technologies wikipedia page a few days ago. This company is known for developing BIOS software for computers. Maybe you've seen their logo when you first turn on your computer.

In early computing, everything was closed sourced. Quoting the wikiepdia page,

To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.

The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.

My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?

int_19h1y ago

You're right that if we want to have usable LLMs at all, there's no way around training them on copyrighted materials. So it has to be allowed, but in a way that compensates the original authors somehow. For example, every model provider has to publicly declare all works used for training, and then all inference providers offering that model have to collect a per-token tax that gets distributed to authors in proportion to their presence in the dataset (by the by, this could also be a way to fund websites like Wikipedia).

But any such arrangement needs to be hammered out by the legislature. As laws are, I think it's pretty clear that infringement is happening.

SoftTalker1y ago

Perhaps Phoenix just looked at the potential adversary (IBM) and decided to approach the project in an exceedingly cautious way, knowing that IBM could litigate it forever if there were any plausible argument that they "copied" even a line of code.

kxrm1y ago

The model doesn't "understand its plot". So I am not sure this is a good analogy.

ilikehurdles1y ago

To what extent connections in a neural network are analogous to connections between neurons in your brain is open to interpretation and study, but the point of the analogy is that in neither case is a copy being made.

amlib1y ago

I can arrange a series of bricks in many ways to try and build a wall but that doesn't mean I will automatically get a good result if my process (like a ML training algorithm) doesn't precisely arrange then in a manner that produces a rigid wall with the desired characteristics. In the same vein you can have a fancy neural network arranged by some fancy LLM training algorithm with gobs of data about a subject but current methods likely won't produce anything with the depth of "understanding" that a human can do. It's a crumbly wall that falls once you do any real inspection or put any real load into it.

jamiek881y ago

Yeah but a copy IS made. A human just reads. The machine copies the full text then compresses a lossy copy in its weights. You keep dodging that with tortuous analogies of a human learning.

I’m sure all these ‘clever’ questions would be useful if this trial was about humans but it’s not.

freejazz1y ago

Great question for a different litigation actually involving humans.

triceratops1y ago

You didn't copy it because you're a human. A computer can't "read" without copying. It's how it works.

johanyc1y ago

The keyword you’re looking for is “transformative”

j / k navigate · click thread line to collapse

0 comments

https://www.law.cornell.edu/uscode/text/17/106

The purpose of copyright is it progress the arts and sciences. Not to guarantee profit. Guaranteeing profit is just the way we encourage people to progress the arts and sciences.

If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.

moregrist1y ago

> That is why so called derivative works are allowed (and even encouraged).

It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.

The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.

dragonwriter1y ago

> That is why so called derivative works are allowed (and even encouraged).

throwaway2901y ago

Derivative works are not generally allowed in many jurisdictions. Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

moregrist1y ago

> Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you

So… it’s complicated. This is one of the weird areas where music copyright and other copyright seem to differ in the US.

0: https://en.m.wikipedia.org/wiki/Mechanical_license

throwaway2901y ago

> covers require a mechanical license, which is compulsory

Which is compulsory for the performer too.

e.g. https://torrentfreak.com/operator-of-popcorn-time-info-site-...

You are absolutely right. I should have phrased that differently. Derivative work is a legal term, but I misused it above. I should have either used another term or been clearer.

If the work is "derivative" in the legal sense it is copyrighted, and you may not create derivative works without the copyright holders permission.

What I should have said is that simply being inspired by a work or copying unprotectable elements (like facts or ideas) does not create a derivative work.

throwaway2901y ago

Learning is like inspiration, it's something humans do. Don't get fooled by "learning" in "machine learning", it doesn't mean machine becomes like human in legal sense...

ndsipa_pomu1y ago

> The purpose of copyright is it progress the arts and sciences. Not to guarantee profit.

That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death

__loam1y ago

You do not understand fair use lol

nilsbunger1y ago

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

TheOtherHobbes1y ago

No transformation is needed.

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

leovander1y ago

> have been bankrupted for doing exactly this.

It's the hosting that gets you, not the act of downloading it.

ndsipa_pomu1y ago

> It's the hosting that gets you, not the act of downloading it.

However, people have been prosecuted for not even hosting a torrent, but merely providing a link to where people can find it.

I would like to expand on this, since it seems to be a common misunderstanding. Lets imagine a hypothetical situation where one friend loans a book to another, who then makes a copy of it.

The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.

If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.

Nothing about AI changes any of this.

3 more replies

jayd161y ago

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

spwa41y ago

It's COPYright. It has to be very close to the original to be covered by copyright. Hence the name.

__loam1y ago

They copied the work when they made the training set.

[1] https://www.iusmentis.com/copyright/software/rights/

mrgoldenbrown1y ago

triceratops1y ago

The counterargument to that is model training is impossible without making copies. That's not true for humans.

Workaccount21y ago

triceratops1y ago

> by being shown an excerpt [of copyrighted material]

How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.

> it is seemingly not far removed from how humans consume content

Except that humans don't make full copies to RAM, or disk or paper.

2 more replies

codedokode1y ago

It is different thing. When you copy data into computer's RAM, that might be copying as defined in law [1]:

For comparison, when a human looks at the letters, there is no copying.

Also, models can reproduce text verbatim which proves that they store it.

So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.

realusername1y ago

It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.

The computer model is working differently of course but functionally it's the same idea.

__loam1y ago

God I hate this conversation so much. These cases have nothing to do with how the brain works.

meta_ai_x1y ago

completely different scenarios. A pirated movie is marketed/sold as a copy of something, which is not fair use. An LLM just remembers/get inspired by what it consumes

xyzzy_plughOP1y ago

I don't believe that's correct. The existence of filters to block potentially copyrighted materials contained in LLM outputs proves that they don't just "get inspired."

It seems like it is very much a matter of fidelity.

> An LLM just remembers/get inspired by what it consumes

Workaccount21y ago

They are in no way compression algorithms. They can be modeled like that in the same way you can model humans as lossy compression algorithms.

simion3141y ago

0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...

> They are in no way compression algorithms.

I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).

1. https://www.inference.org.uk/itprnn/book.pdf

2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

amelius1y ago

> I'm legitimately curious what the test is here.

The test is if a judge says it is fair use, nothing else.

realusername1y ago

It's the same as a book or a film review, you can't get the film or the book back from it but the original material is still needed to produce it.

Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples

__loam1y ago

Movie reviews are fair use because they don't compete with the original work.

realusername1y ago

The AI models also do not compete with the original work.

Nobody is going to try to extract pages by pages a book from ChapGPT, let's be realistic. (and you can't anyways)

__loam1y ago

https://www.gnu.org/philosophy/not-ipr.en.html

ilikehurdles1y ago

If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

The model isn’t storing the book.

mcny1y ago

> If you read a book and later understand its plot but can only explain it in your own words, did you copy it?

Personally, I think we have several fixes we need to make:

  1. Abolish the CFAA. 
  2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason. 
  3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada. 
  4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works. 
  5. I am sure I am missing some stuff here.

For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.

waynesonfire1y ago

In early computing, everything was closed sourced. Quoting the wikiepdia page,

The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.

int_19h1y ago

But any such arrangement needs to be hammered out by the legislature. As laws are, I think it's pretty clear that infringement is happening.

SoftTalker1y ago

kxrm1y ago

The model doesn't "understand its plot". So I am not sure this is a good analogy.

ilikehurdles1y ago

amlib1y ago

jamiek881y ago

Yeah but a copy IS made. A human just reads. The machine copies the full text then compresses a lossy copy in its weights. You keep dodging that with tortuous analogies of a human learning.

I’m sure all these ‘clever’ questions would be useful if this trial was about humans but it’s not.