Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?
[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
> the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.
See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...
In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.
So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.
But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.
On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.
Also note:
- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.
- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).
Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.
I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).
I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.
The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.
This is still a weird language shift that actively promotes misunderstandings.
The weights are the LLM. When you say "model", that means the weights.
If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.
This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.
The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.
The model itself does not constitute a copy. Its intention is clearly not to reproduce verbatim texts. There would be far cheaper and infinitly more accurate ways to do that if that was the goal.
Appart from the legalities, it would be horrifying if copyright reached into the AI realm to completely styfle progress for, lets be honest, mainly the profits of a few major IP corporations.
I do however understand some creatives are worried about revenue, just like the rest of us. But just like the rest of us, they to live in a world that can only exist because 99.99% of what it took to build that world was automated or tool enhanced, impacting someone's previous employment or business.
We are in a world of unprecedented change, only to be immediatly supassed by the next day's rate of change. This both scares and fascinates me.
But that change and its benefits being held only in the bowels of corporate/government symbiotic entities would scare me a hell of a lott more. Open Source/weights is the only way to have a small chance to keep this at bay.
No, it doesn't. The order assumes that because it is an order on summary judgement, and the legal standard for such an order is that it must assume the least favorable position for the party for whom summart judgement is granted on every material contested issue of fact. Since it is a ruling for the defendant (Anthropic), it must be what the judge finds law demands when assuming all contested issues of fact are resolved in favor of the claims of the plaintiffs (the authors).
> but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text.
No, it doesn't do that, either. It simply notes for clarity that the plaintiffs do not allege that that an infringement is created by the outputs for the reason you describe; the ruling does not in any way suggest that has any bearing on its findings as regards whether training the model infringes, it simply points out that that separate potential source of infringement is not at issue.
> Does this imply that distributing open-weights models such as Llama is copyright infringemen
No, it does not. At most, it implies, given the reason that rhe plaintiffs have not done so in this case, that the same plaintiffs might have alleged (without commenting at all as to whether they would prevail) that providing a hosted online service without filtering would constitute contributory infringement if that was what Anthropic did (which it isn’t) and if there was actual infringement committed by the users of the service.
The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.
If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?
The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.
In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.
---
If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.
This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.
The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.
The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.
Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.
LLMs are models of languages, which are models of reality. If anyone deserves compensation, it's humanity as a whole, for example by nationalizing, or whatever the global equivalent is, LLMs.
Approximately none of the value of LLMs, for any user, is in recreating the text written by an author. Authors have only ever been entitled to (limited) ownership their expression, copyright has never given them ownership of facts.
Purposes which are fair use are very often not at all personal.
(Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)
So it totally isn't a warez streaming media server but AI?
I'm guessing since my net worth isn't a billion plus, the answer is no
This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.
This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.
Overall seems like a pretty reasonable ruling?
I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.
"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)
I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).
(Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)
That's not what the ruling says.
It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.
These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)
> This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.
(But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)
I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.
If I buy a book, and as long as the product the book teaches me to build isnt a competing book, the original author should have no avenue for complaint.
People are really getting hung up on the computer reading the data and computing other data with it. It shouldnt even need to get to fair use. Its so obviously none of the authors business well before fair use.
Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.
Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.
Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?
Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.
In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.
Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions
Human time is inherently valuable, computer time is not.
The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)
If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.
The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.
That is fundamentally exploitative, whether the current laws accounted for that situation or not.
I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.
How many copies? They're not serving a single client.
Libraries need to have multiple e-book licenses, after all.
https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...
I'm on the Air Pirates side for the case linked, by the way.
However, AI is not a parody. It's not adding to the cultural expression like a parody would.
Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:
Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.
Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.
And who gets the money? Not the original author.
LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.
If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.
However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.
This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.
Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)
First, I dont think publishers of physical books in the US get the right to establish a contract. The book can be resold for instance and that right cannot be diminished. But secondly adding more cruft to the distribution of something that the end user has a right to transform, isn't going to diminish that right.
Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.
It’s also proof that an individual scientist can still change the world, in some small way. Believe in yourself and just focus on your work, even if the work is controversial.
(I’m late to the thread, so ~nobody will see this. But it’s the culmination of about five years of work for me, so I wanted to post a small celebratory comment anyway. Thank you to everyone who was supportive, and who kept an open mind. Lots of people chose to throw verbal harassment my way, even offline, but the HN community has always been nice.)
Cassette Tapes and Private Copying Levy.
https://en.wikipedia.org/wiki/Private_copying_levy
Governments didn't ban tapes but taxed them and fed the proceeds back into the royalty system. An equivalent for books might be an LLM tax funding a negative tax rate for sold books e.g. earn $5 and the gov tops it up. Can't imagine how to ensure it was fair though.
Alternatively, might be an interesting math problem to calculate royalties for the training data used in each user request!
Such imperfect measures offer a compromise between "big tech can steal everything" and "LLMs trained on unpurchased books are illegal".
It's not just books but any tragedy-of-the-commons situation where a "feeder industry" for training can be fatally undermined by the very LLM that desires future training data from that industry.
But for non-fiction and technical books, they sure as hell will. For example, if I ask Gemini "What does Bob Nystrom say about the Singleton pattern in Game Programming Patterns?", it says:
---
In "Game Programming Patterns," Bob Nystrom dedicates an entire chapter to the Singleton pattern, which is unusual because, as he states, "Every other chapter in this book shows you how to use a design pattern. This chapter shows you how not to use one."
Here's a summary of his key points regarding the Singleton pattern:
* More Harm Than Good: Nystrom argues that despite its noble intentions, the Singleton pattern often does more harm than good, especially in game development. He notes that the Gang of Four (GoF) stressed its sparing use, a message that was often lost in translation to the game industry.
* Global State in Disguise: He highlights that the Singleton is essentially a global variable in disguise. While it ensures a single instance, it often leads to problems associated with global state, such as tightly coupled code, difficulty in testing, and reduced flexibility.
* Encourages Tight Coupling: By providing a global access point, Singletons encourage various parts of the codebase to directly depend on them. This makes the code harder to change and refactor because modifications to the Singleton can have widespread, unpredictable effects.
* Hinders Testability: Testing code that relies on Singletons can be challenging. Because Singletons are globally accessible and often difficult to mock or replace, it can be hard to isolate units of code for testing, leading to brittle tests or a lack of testing altogether.
* Limits Flexibility: Nystrom points out that while a class might currently only need one instance (e.g., a FileSystem or AudioSystem), hardcoding this limitation with a Singleton makes it difficult to change in the future. If a game later needs multiple file systems (e.g., for different save slots) or different audio contexts, a Singleton makes this a non-trivial refactoring.
* Alternatives Often Better: He suggests that often, what developers think they need a Singleton for can be better achieved with other patterns or simpler solutions. For instance, if the only goal is to have a single instance, a simple static class might suffice if global access is truly desired and initialization order isn't a complex concern. For managing services, he often advocates for the Service Locator pattern as a less problematic alternative that still provides global access without enforcing a single instance.
* Misconceptions about "Singular Resources": While the GoF suggested Singletons for "physical devices," Nystrom argues that few physical devices are truly singular in a way that necessitates a Singleton. For example, while you might only have one keyboard, a game could theoretically support multiple input devices. Using a Singleton prematurely prevents this flexibility.
* Lazy Initialization: He acknowledges that a well-implemented Singleton using lazy initialization (creating the instance only when it's first accessed) can save memory and CPU cycles and resolve some issues with static initialization order. However, he emphasizes that these benefits don't outweigh the pattern's fundamental architectural drawbacks.
* In essence, Nystrom's strong stance is that while the Singleton pattern can ensure a single instance and provide global access, its downsides (coupling, testability, flexibility) often make it an anti-pattern in game development, and developers should carefully consider alternatives before reaching for it.
---
Is that summary as good as actually reading the book? Probably not. Will people rely on that and skip buying my book. Almost certainly so.
IMHO, a better example would be the AI generated summaries provided by Google. Often these summaries have sufficient information and detail that people do not read the source article. The authors aren't getting paid (perhaps through on-page ads, which are not viewed) and then go out of business.
This strikes me as a good fit for the tax-on-cassette metaphor.
The last thing the world needs is more nonsensical copyright law and hand wavy regulation funded by entrenched interests.
I do concede that the book does contain a distillation of material that is also available from other sources, but it also contained a lot of personal experience. That aspect does seem to be lost in this new representation.
I am not saying that letting AI models read the material is wrong, but the hubris in the way models answer questions is annoying.
I doubt the exact replica stuff will stand, as technically it was only achievable via advanced prompt engineering (hacking), not simply asking for a replica. So their 2 other arguments boils down to scraping a news database = infringement and LLM output = derivative works.
It’s not as simple as it sounds, since I’m sure scraping is against Reddit’s terms and conditions, but if those posts are made publicly available without the scraper actually agreeing to anything, is that a valid breach of contract?
Will be interesting to see how that plays out.
Interesting excerpt:
> “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”
Language of “pirated” and “theft” are from the article. If they did realize a mistake and purchased copies after the fact, why should that be insufficient?
I don't think that's exactly the case. A lot of the HN crowd is very much against the current iterations of copyright law, but is much more against rules that they see as being unfairly applied. For most of us, we want copyright reform, but short of that, we want it to at least pretend to be used for what it is usually claimed to be for: protecting small artists from large, predatory companies.
They aren't sides of the same coin, so neither? They have as much in common as a balloon full of helium and the an opossum.
Folks try to create a false equivalency between landlords and creatives, but they aren't remotely the same. I generally consider this to be a bad faith argument by people who just want free things. (The argument against landlords isn't free housing, even though the argument against copyright is piracy)
Landlords have something with a limited supply and rent it to other people for their use. Access to the particular something is necessary on the residential side and generally important on the commercial side.
Copyrighted works haven't had a limited supply since around 1440 and are a couple rungs higher on Maslow's hierarchy of needs. Copyright laws are, by their nature, intended to simulate the market effects of a limited supply as to incentivize people to create those works.
Have laws and vultures created perverse incentives in both markets? Absolutely. Are there both good and bad landlords and copyright holders? Absolutely.
But we could address the flaws in one without even thinking to talk about the other.
As just a matter of society, I don't think you want people say stealing a car and then coming back a month later with the money.
Copyright infringement does not deprive the copyright owner of its property and is not criminal. So in this case only the lawsuit part applies. The owner is only entitled to the monetary damages, which is the lost sale. But in this case the sale price was paid to the owner 1 month later, so the only real damages will be the interest the publisher could have earned if they had got their money one month earlier.
1. You're assuming this was some good faith "they didn't know they were stealing" factor. They use someone else's product's for commercial use. I'm not so charitable in my interpretation.
2. I'm not absolved of theft just because I go back and put money on the register. I still sttole, intentionally or not
Google literally scrapes pirated content all day every day. When they do that they have no idea if the content was legally placed on that website. Yet, they scan and index it anyway because there's actually no way to know (at all!). There's no great big database of all copyrighted works they can reference.
I'm not saying Meta and Anthropic didn't know they were pirating content. I'm saying that it should be moot since they never distributed it. You can't claim a violation of copyright for content that was never actually "copied" (aka distributed). The site/seeders that uploaded the content to Meta/Anthropic are the violators since copyright is all about distribution rights.
Choosing someone's bitstrings is like choosing to harvest someone's fields in a world where there's infinite space of fertile fields. You picked his, instead of finding a space in the infinite expanse to farm on your own.
If you start writing something you'll never generate a copyrighted work at random. When the work isn't available nothing is taken away from you even if you were strictly forbidden from reproducing the work.
Choosing someone's particular bitstring is only done because there's someone who has expended effort in preparing it.
If the US makes it illegal to train LLMs on copyrighted data, the US will find a solution and not just give up and wait half a decade to see what China does in the meantime.
So what is he going to do about the initial copyright infringement? Will the perpetrators get the Aaron Schwartz treatment?
Its like saying it should be legal for me to have this Judges nudes obtained 100% illegally as long as I pixelate all the naughty bits.
I'm not sure why this alone is considered a separate issue from training the AI with books. Buying a copy of a copyrighted work doesn't inherently convey 'fair use rights' to the purchaser. If I buy a work, read it, sell it, and then publish a review or parody of it, I don't infringe copyright. Why does mere possession of an unauthorized copy create a separate triable matter before the court?
Keep in mind, you can legally engineer EULAs in such a way that merely purchasing the work surrenders all of your fair use rights. So this could wind up being effectively: "AI training is fair use for works purchased before June 24th, 2025, everything after is forbidden, here's your brand new moat OpenAI"
Which suggests that, at least in the judge's opinion, 'fair use rights' do exist in a sense, but it's about when you read the book, not when you publish.
But that's not settled precedent. Meta is currently arguing the opposite in Kadrey v. Meta: they're claiming that they can get away with torrenting training material as long as they only leech (download) and don't seed (upload), because, although the act of downloading (copying) is generally infringement under a Ninth Circuit precedent, they were making a fair use.
As for EULAs, that might be true for e-books, but publishers can't really do anything about Anthropic's new strategy of scanning physical books, because physical books generally don't come with shrinkwrap license agreements. Perhaps publishers could start adding them, but I think that would sit poorly with the public and the courts.
(That's assuming the ruling isn't overturned on appeal, which it easily might be.)
That has yet to be determined in a court of law. Just like: You can write a contract to kill but that won't make it legal.
The Supreme Court ruled that Fair use is an essential component that makes copyright law compatible with the First Amendment. I highly suspect that if if ever comes up in the SCOTUS they will rule that only signed contracts can override Fair Use. Meaning: Clickwrap agreements or broad contracts required by ebook publishers (e.g. when you use their apps) don't count.
Also, if you violate a contract by posting an excerpt of an ebook you purchased online would require the publisher to sue you in court (or at least force arbitration) over that contract violation. They could not use tools like the DMCA in such instance to enforce a takedown request.
There's no, "Hey! They're violating our contract, I swear!" takedown feature in contract law like there is with copyright law (the DMCA).
You have to call it "Starcrash" (https://www.imdb.com/title/tt0079946/?ref_=ls_t_8). Then it's legal.