Why do you think "fair use" doesn't apply in this case? The prior Bartz vs Anthropic ruling laid out pretty clearly how training an AI model falls within the realm of fair use. Authors Guild vs Google and Authors Guild vs HathiTrust were both decided much earlier and both found that digitizing copyrighted works for the sake of making them searchable is sufficiently transformative to meet the standards of fair use. So what is it about GPL licensed software that you feel would make AI training on it not subject to the same copyright and fair use considerations that apply to books?
The poster doesn't like it, so it's different. Most of the "legal analysis" and "foregone conclusions" in these types of discussions are vibes dressed up as objective declarations.
Whether data acquired from a licence that specifically forbids building a derivative work without also releasing that derivative under the same licence counts as a legitimate data gathering operation is anyone's guess, as those specific circumstances are about as far from that prior case as they can be.
> This License acknowledges your rights of fair use or other equivalent, as provided by copyright law.
It is legitimate to acquire GPL software. The requirements of the license only occur if you're distributing the work AND fair use does not apply.
Training certainly doesn't count as distribution, so the buck passes to inference, which leaves us dealing with substantial similarity test, and still, fair use.
“To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. However, Anthropic had no entitlement to use pirated copies for its central library. Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic’s piracy.”
Or in the final judgement, “This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason.”
The first:
> it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library
It is only fair use where Anthropic had already purchased a license to the work. Which has zero to do with scraping - a purchase was made, an exchange of value, and that comes with rights.
The second, which involves a section of the judgement a little before your quote:
> And, as for any copies made from central library copies but not used for training, this order does not grant summary judgment for Anthropic.
This is where the court refused to make any ruling. There was no exchange of value here, such as would happen with scraping. The court made no ruling.
On the real measures of "fair use", at least in the US: https://fairuse.stanford.edu/overview/fair-use/four-factors/ I would contend that it absolutely face plants on all four measures. The purpose is absolutely in the form of a "replacement" for the original, the nature is something that has been abundantly proved many times over in court as being something copyrightable as a creative expression (with limited exceptions for particular bits of code that are informational), the "amount and substantiality" of the portions used is "all of it", and the effect of use is devastating to the market value of the original.
You may disagree. A long comment thread may ensue. However, all I really need for my point here is simply that it is far, far from obvious that waving the term "FAIR USE!" around is a sufficient defense. It would be a lengthy court case, not a slam-dunk "well duh it's obvious this is fair use". The real "fair use" and not the internet's "FAIR USE!" bear little resemblance to each other.
A sibling comment mentions Bartz v. Anthropic. Looking more at the details of the case I don't think it's obvious how to apply it, other than as a proof that just because an AI company acquired some material in "some manner" doesn't mean they can just do whatever with it. The case ruled they still had to buy a copy. I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2". It's a somewhat lengthy road to travel, where each step could result in a failure, but the same can be said for the road to "just because I can lay my hands on it means I can feed it to my AI and 100% own the result" and that has already had a step fail.
I'm referring to the long history of carefully threaded fair use rulings and settlements, many of which we as an industry have benefitted greatly from. From determinations that cloning a BIOS can be fair use (see IBM PC bios cloning, but also Sony v. Connectix), or that cloning an entire API for the purposes of creating a parallel competitive product (Google v. Oracle), or digitizing books for the purposes of making those books searchable and even displaying portions of those books to users (Authors Guild v. Google) or even your cable company offering you "remote DVR" copying of broadcast TV (20th Century Fox v. Cablevision). Time and again the courts have found that copyright, and especially copyright with respect to digital transformations is far more limited than large corporations would prefer. Further they have found in plenty of cases that even a direct 1:1 copy of source can be fair use, let alone copies which are "transformative" as LLM training was found to be in Bartz.
Realistically, I don't see how anyone can have watched the various copyright cases that have been decided in the digital age, and seen the battles that the EFF (and a good part of the tech industry) have waged to reduce the strength of copyright and not also see how AI training can very easily fit within that same framework.
Not to cast aspersions on my fellow geeks and nerds, but it has been very interesting to me to watch the "hacker" world move from "information wants to be free" to "copyright maximalists" once it was their works that were being copied in ways they didn't like. For an industry that has brought about (and heavily promoted and supported) things like DeCSS, BitTorrent, Handbrake, Jellyfin/Plex, numerous emulators, WINE, BIOS and hardware cloning, ad blockers, web scrapers and many other things that copyright owners have been very unhappy about, it's very strange to see this newfound respect for the sanctity of copyright.
> I can easily make a case that "buying a copy" in the case of a GPL-2 codebase is "agreeing to the license" and that such an agreement could easily say "anything trained on this must also be released as GPL-2".
And I would argue that obtaining a legal copy of the GPL source to a program requires no such agreement. By downloading a copy of a GPLed program I am entitled by the terms under which that software was distributed to obtain a copy of the source code. I do not have to agree to any other terms in order to obtain that source code, downloading from someone authorized to distribute that code is in and of itself sufficient to entitle me to that source code. You can not, by the very terms of the GPL itself deny me a copy of the source code for GPL software you have distributed to me, even if you believe I intend to make distributions that are not GPL compliant. You can decline to distribute the software to me in the first place, but once you have distributed it to me, I am legally entitled to a copy of the source code. From there, now that I have a legal copy, the question becomes is making additional copies for the purposes of training an AI model fair use? So far, the most definitive case we have on the matter (Bartz) says yes it is.
So either we have to make the case that the original copy was somehow acquired from a source not authorized to make that copy, or we have to argue that the output of the AI model or the AI model is itself infringing. Given the ruling that copies made for training an AI model was ruled "exceedingly transformative and was a fair use under Section 107 of the Copyright Act"[1] it seems unlikely that the AI model itself is going to be found to be infringing. That leaves the output of the model itself, which Bartz does not rule on, as the authors never alleged the output of the model was infringing. GPL software authors might be able to prevail on that point, but they would have a pretty uphill battle I think in demonstrating that the model generated infringing output and not simply functional necessary code that isn't covered by copyright. The ability of code to be subject to copyright has long been a sort of careful balance between protecting a larger creative idea, and also not simply walling off whole avenues of purely functional decisions from all competitors.
[1]: https://admin.bakerlaw.com/wp-content/uploads/2025/07/ECF-23...
The current supreme court may think that machine learning is some sort of magic exception, but they also seem to believe whatever oligarchs will bribe them to believe. Again, I doubt the law will be enforced as written, but that has more to do with corruption than any meaningful legal theory. Arguments against this claim seem to ignore that courts have already ruled these systems to not have intellectual property rights of their own, and the argument for fair use seems to rely pretty heavily on some handwavey anthropomorphization of the models.
Broadly speaking, yes it does. The whole point of fair use is that you don’t need a license.
Here’s the 4 part test from 17 U.S.C. § 107:
1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
Fail. The use is to make trillions of dollars and be maximally disruptive.
2. the nature of the copyrighted work;
Fail. In many cases at least, the copy written code is commercial or otherwise supports livelihoods; and is the result much high skill labor with the express stipulation for reciprocity.
3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
Fail. They use all of it.
4. the effect of the use upon the potential market for or value of the copyrighted work.
Fail to the extreme. There is already measurable decline in these markets. The leaders explicitly state that they want to put knowledge workers out of business.
- - -
Hell, LLMs don’t even pass the sniff test.
The only reason this stuff is being entertained is some combination of the prisoner’s dilemma and more classic greed.