If I produce a terrible shakycam recording of a film while sitting in a movie theater, it's not a verbatim copy, nor is it even necessarily representative of the original work -- muddied audio, audience sounds, cropped screen, backs of heads -- and yet it would be considered copyright infringement?
How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.
That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.
If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.
Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”
It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.
The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.
Derivative works are not "allowed (and even encouraged)" without a license from the copyright holder. Creating a derivative work is an exclusive right of the copyright holder just like making verbatim copies and requires a license for anyone else, unless an exception to copyright protection (like fair use) applies.
Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it
(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)
So… it’s complicated. This is one of the weird areas where music copyright and other copyright seem to differ in the US.
In the US the situation is complex and there are a lot of weird special interests [0], but generally a composer/author of a song has the right to decide who first records and releases the song, but after the first recording covers require a mechanical license, which is compulsory (ie: the author cannot object).
In music there are _a lot_ of special cases and different rights are decided with different kinds of licenses, some of which are compulsory. I think it’s an area that doesn’t make for good analogies with copyright in other media.
Which is compulsory for the performer too.
A derivative work like cover is sort of acceptable when it's performed by a person live for some audience (grey area but twitch sort of allows it. with a bunch of rules). As soon as you want to publish it you MUST have a license. And chatbot is a derivative work totally not performed live by a person for some audience
I saw great tracks that got taken down from all legal channels because they featured a sample from another song. Sometimes they remained up but mostly they were taken down. It is fully original publisher's discretion...
If the work is "derivative" in the legal sense it is copyrighted, and you may not create derivative works without the copyright holders permission.
What I should have said is that simply being inspired by a work or copying unprotectable elements (like facts or ideas) does not create a derivative work.
For example, if ChatGPT were to generate Star Wars, except with Dookies instead of Wookies, that might be illegal. If it were to learn what a spaceship is from Star Wars and then create something substantially new it would not. The key is is that it must not be substantially similar to the original. You must add enough value that it becomes something new, not just rehash the original.
That seems to go against the notion that copyright can last beyond the author's lifetime - most arts and science progress tends to reduce after death
When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.
The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"
Individuals who torrented music and video files have been bankrupted for doing exactly this.
The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.
If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.
Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).
It's the hosting that gets you, not the act of downloading it.
However, people have been prosecuted for not even hosting a torrent, but merely providing a link to where people can find it.
e.g. https://torrentfreak.com/operator-of-popcorn-time-info-site-...
The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.
The borrower is well within his rights to accept the book, and as the current owner he is even allowed to make a copy of the book (see the famous TIVO case). Making this illegal would end backups and format/time shifting.
When the borrower returns the book, he keeps the copy. Oh no! Surely he must now become a criminal? Nope. Possessing an unauthorized copy is also not illegal, despite what many copyright holders would like you to believe. Making this illegal would also criminalize a lot of legitimate format/time shifting, again see the famous TIVO case.
If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.
Nothing about AI changes any of this.
Seems like a big gap there.
How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.
> it is seemingly not far removed from how humans consume content
Except that humans don't make full copies to RAM, or disk or paper.
> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.
For comparison, when a human looks at the letters, there is no copying.
Also, models can reproduce text verbatim which proves that they store it.
So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.
The computer model is working differently of course but functionally it's the same idea.
It seems like it is very much a matter of fidelity.
As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.
Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.
You would never use a human to backup your financial reports, but the human might be able to give a good overview. You would never use an LLM to backup your financial reports, but they might be able to give a good overview.
AI training data is disposable. There is nothing that could be called a compression algorithm that disposes all of the data you put into it. AI uses training data as examples of what the next token in a token sequence is. The examples are disposable reference points, not the model itself. That's how you get image models that are 20GB in size despite training on 20PB of data. It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.
You can compress 20PB of text to 20Gb or even less, if input is super repetitive. So the same with images, if 50% of the images are cats then you learn how to represent the cat pixels with a few vectors and then you could represent all the cats int he world doing all possible cat actions.
But please have the courage to respond to this, when the AI is caught regurgitating the exact text from a popular book, the exact verses from a poem, the exact code function from some code , then how can you defend that is not memorizing things? If a human uses my poem(after they read it) and signs his name under it would you defend them?
I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).
From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.
Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].
0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...
1. https://www.inference.org.uk/itprnn/book.pdf
2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...
The test is if a judge says it is fair use, nothing else.
The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.
Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples
Nobody is going to try to extract pages by pages a book from ChapGPT, let's be realistic. (and you can't anyways)
The model isn’t storing the book.
I think that is the center of the conversation. What does it mean for a computer to "understand"? If I wrote some code that somehow transformed the text of the book and never return the verbatim text but somehow modified the output, I would likely not be spared because the ruling will likely be my transformation is "trivial".
Personally, I think we have several fixes we need to make:
1. Abolish the CFAA.
2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason.
3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada.
4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works.
5. I am sure I am missing some stuff here.
For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.In early computing, everything was closed sourced. Quoting the wikiepdia page,
To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.
The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.
My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?
But any such arrangement needs to be hammered out by the legislature. As laws are, I think it's pretty clear that infringement is happening.
I’m sure all these ‘clever’ questions would be useful if this trial was about humans but it’s not.