But agreed that we're waiting for a court case to confirm that. Although really, the main questions for any court cases are not going to be around the principle of fair use itself or whether training is transformative enough (it obviously is), but rather on the specifics:
1) Was any copyrighted material acquired legally (not applicable here), and
2) Is the LLM always providing a unique expression (e.g. not regurgitating books or libraries verbatim)
And in this particular case, they confirmed that the new implementation is 98.7% unique.
This is just an assertion that you're making. There's no argument here. I'm aware that this is also an assertion that some judges have made.
My claim is that LLMs are not human, therefore when you apply words like "training" to them, you're only doing it metaphorically. It's no more "training" than copying code to a different hard drive is training that hard drive. And it's no more "transformative" than rar'ing or zipping the code, then unzipping it. I can't sell my jpgs of pngs I downloaded from Getty.
I have no idea how LLMs can be considered transformative work that immunizes me from owing the least bit of respect to the source material, but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved. I might even owe money to the people who wrote the filters, depending on the licensing.
> 98.7% unique.
This doesn't mean anything. This is a meaningless arrangement of words. The way we figure out things are piracy is through provenance, not bizarre ad hoc measurements. If I read a book in Spanish and rewrite it in English, it doesn't suddenly become mine even though it's 96.6492387% unique. Not even if I drop a few chapters, add in a couple of my own, and change the ending.
...OK? Was somebody asking me for an "argument"? I'm just stating how things are currently understood.
> And it's no more "transformative" than rar'ing or zipping the code, then unzipping it.
That's obviously false, so I'm not sure what to tell you.
> but if I sample 2-6 second snatches from 10 different songs, put them through over 9000 filters and blend them into a new work, I owe money to everyone involved
You don't, actually, if they're no longer recognizable -- which they wouldn't be after "9000 filters". I don't know where you got that idea that you'd still owe money. And I've certainly never heard of an audio filter license that was contingent on commerical distribution.
> This doesn't mean anything. This is a meaningless arrangement of words.
Statistics are meaningful. Obviously you need to look at the actual identical lines. But if they're a bunch of trivial things like initializing variables with obvious names, then they don't count for much. And if you're adhering to the same API, you would expect to have some small percentage of lines happen to match. So the fact that this is <2%, as opposed to 40%, is hugely significant as a first step of analysis.
I suggest you might find conversations here on HN more productive if you soften your tone a bit. Saying things like "this is just an assertion that you're making" or "this is a meaningless arrangement of words" is not generally going to make people want to respond to you.
If you’ve used copyrighted books and turned them into a free write-a-book machine, you are suddenly using the authors own works against them, in a way that a judge might rule is not very fair.
“ Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.”
This is for the same reason that search results or search snippets aren't deemed to harm creators according to copyright. Yes there might be some percentage lost of sales. And truly, people may be buying less JavaScript tutorial books now that LLM's can teach you JavaScript or write it for you. But the relation is so indirect, there's very little chance a court would accept the argument.
Because what the LLM is doing is reading tons of JavaScript and JavaScript tutorials and resources online, and producing its own transformed JavaScript. And the effect of any single JavaScript tutorial book in its training set is so marginal to the final result, there's no direct effect.
And the reason this makes sense is that it's no different from a teacher reading 20 books on JavaScript and then writing their own that turns out to be a best-seller. Yes, it takes away from the previous best-sellers. But that's fine, because they're not copying any of the previous works directly. They're transforming the facts they learned into a new synthesis.
Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.
> In copyright terms, it's such an extreme transformative use that copyright no longer applies.
Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?
I feel as though, from an information-theoretic standpoint, it can't be possible that an LLM (which is almost certainly <1 TB big) can contain any substantial verbatim portion of its training corpus, which includes audio, images, and videos.
It doesn't need to for my argument to make sense. It's a problem if it reproduces a single copyrighted work (near)-verbatim. Which we have plenty of examples of.
No we don't have to, but so far we do, because that's the most legally consistent. If you want to change that, you're going to need to pass new laws that may wind up radically redefining intellectual property.
> Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim?
Of course it has, if the transformation is extreme, as it appears to be here. If I memorize the lyrics to a bunch of love songs, and then write my own love song where every line is new, nobody's going to successfully sue me just because I can sing a bunch of other songs from memory.
Also, it's not even remotely clear that the LLM can produce the training data near-verbatim. Generally it can't, unless it's something that it's been trained on with high levels of repetition.
> you're going to need to pass new laws that may wind up radically redefining intellectual property
You're correct that this is one route to resolving the situation, but I think it's reasonable to lean more strongly into the original intent of intellectual property laws to defend creative works as a manner to sustain yourself that would draw a pretty clear distinction between human creativity and reuse and LLMs.
So do 10 000 chimpanzees on typewriters.
Training an LLM inherently requires making a copy of the work. Even the initial act of loading it from the internet and copying it into memory to then train the LLM is a copy that can be governed by its license and copyright law
> The court held that making RAM copies as an essential step in utilizing software was permissible under §117 of the Copyright Act even if they are used for a purpose that the copyright holder did not intend.
https://en.wikipedia.org/wiki/Vault_Corp._v._Quaid_Software_....
IIRC this exact argument was made in the Blizzard vs bnetd case, wasn't it? Though I can't find confirmation on whether that argument was rejected or not...
But that's not relevant here. Because the copyleft license does not prohibit that (and it's not even clear that any license can prohibit it, as courts may confirm it's fair use, as most people are currently assuming). That's why I noted under (1) that it's not applicable here.
LLM training involves ingesting works (in a potentially transformative process) and partially reproduce them - that's a generally restricted action when it comes to licensing.
BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.
That's not relevant, because you can still sue the person using the LLM and publishing the repository. Legal liability is completely unchanged.
It's changed completely, from your own example.
If you comission art from an artist who paints a modified copy of Warhol's work, the artist is liable (even if you keep that work private, for personal use).
If you commission it from OpenAI (by sending a query to their ChatGPT API), by your argument, you are the person liable — and OpenAI is off the hook even if that work is distributed further.
I'm not going to argue about the merits of creativity here, or that someone putting a prompt into ChatGPT considers themselves an artist.
That's irrelevant. The work is created on OpenAI servers, by the LLMs hosted there, and is then distributed to whoever wrote the prompt.
Models run locally are distributed by whoever trained them.
If you train a model on whatever data you legally have access to, and produce something for yourself, it's one thing.
Distribution is where things start to get different.