Thomson Reuters wins first major AI copyright case in the US (opens in new tab)

(wired.com)

392 pointsjohnneville1y ago177 comments

177 comments

Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers: https://storage.courtlistener.com/recap/gov.uscourts.ded.721...

The core story seems to be: Westlaw writes and owns headnotes that help lawyers find legal cases about a particular topic. Ross paid people to translate those headnotes into new text, trained an AI on the translations, and used those to make a model that helps lawyers find legal cases about a particular topic. In that specific instance the court says this plan isn't fair use. If it was fair use, one could presumably just pay people to translate headnotes directly and make a Westlaw competitor, since translating headnotes is cheaper than writing new ones. And conversely if it isn't fair use where's the harm (the court notes no copyright violation was necessary for interoperability for example) -- one can still pay people to write fresh headnotes from caselaw and create the same training set.

The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

anon3738391y ago

This is an interesting opinion, but there are aspects of it that I doubt will stand the test of time.

One aspect is the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark”. It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

The key fact underlying all of this, I think, is that when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis. Source text was paraphrased using curiously similar language to West’s paraphrasing. That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.

AnthonyMouse1y ago

> That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The "competing product" thing is probably the most extreme part of this opinion.

The most important fair use factor is if the use competes with the original work, but this is generally implied to be directly competes, i.e. if you translate someone else's book from English to French and want to sell the translation, the translation is going to be in direct competition for sales to people who speak both English and French. The customer is going to use the copy claiming fair use as a direct substitute for the original work, instead of buying it.

This court is trying to extend that to anything downstream from it, which seems crazy. For example, "multiple copies for classroom use" is one of the explicit examples of fair use from the copyright statute, but schools are obviously teaching people intending to go into competition with the original author, and in general the idea that you can't read something if you ever intend to write something to sell in competition with it seems absurd and in contradiction to the common practices in reverse engineering.

But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.

mountainb1y ago

No that is not an extreme interpretation of the fair use factors. This is a routinely emphasized factor in fair use analyses for both copyright and trademark. School fair use is different because that defense is written into the statute directly in 17 U.S.C. § 107. Also, § 108 provides extensive protections for libraries and archives that go beyond fair use doctrines.

The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition. Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price. Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.

But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?

2 more replies

DrScientist1y ago

The case looks pretty straightforward to me - they copied the notes ( human or machine doesn't really matter ) to directly compete with the author of the notes.

If you wrote a program that automatically rephrased an original text - something like the Encyclopaedia Britannica - to preserve the meaning but not have identical phrasing - and then sold access to that information on in a way that undercut the original - then in my view that's clearly ripping off the original creators of the Encyclopedia and would likely stop people writing new versions of the encyclopedia in the future if such activity was allowed.

These laws are there to make sure that valuable activities continue to happen and are not stopped because of theft. We need textbooks, we need journalistic articles - to get these requires people to be paid to work on them.

I think it's entirely reasonable to say that an LLM is such a program - and if used on sources which are sustained by having paid people work on them, and then the reformatted content is sold on in a way to under cut the original activity then that's a theft that's clearly damaging society.

I see LLM's as simply a different way to access the underlying content - the rules of the underlying content should still apply - ChatGPTs revenues are predicted to be in the billions this year - sending some of that to content creators, so that content continues to be produced, is not just right - it's in their interest.

1 more reply

bryanrasmussen1y ago

I think that someone taking Biology 101 and ending up writing textbooks, as opposed to all the other people who just forgot what they learned once the elective was over or ended up working biologists with labs or teachers of biology and so forth, is quite different than someone saying hey I want to make a competing product to this successful company, let's take their content, re-write and use AI to make a competitor, and then actually going into direct competition with that company a couple years later

pigbearpig1y ago

" court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim"

That is the opposite of the ruling. The judge said the ones that summarize and pick out the important parts are copyrightable and specifically excludes the headnotes that quote court opinion verbatim.

The judge:

"But I am still not granting summary judgment on any headnotes that are verbatim copies of the case opinion (for reasons that I explain below)"

anon3738391y ago

You're right as far as the MSJ is concerned, and I should've been more precise. I was focusing on the dictum in the preceding paragraph (because we're discussing the broader implications of the order rather than the nuts-and-bolts of the instant motion). In that paragraph, the judge wrote:

> More than that, each headnote is an individual, copyrightable work. That became clear to me once I analogized the lawyer’s editorial judgment to that of a sculptor. A block of raw marble, like a judicial opinion, is not copyrightable. Yet a sculptor creates a sculpture by choosing what to cut away and what to leave in place. That sculpture is copyrightable. 17 U.S.C. §102(a)(5). So too, even a headnote taken verbatim from an opinion is a carefully chosen fraction of the whole. Identifying which words matter and chiseling away the surrounding mass expresses the editor’s idea about what the important point of law from the opinion is. That editorial expression has enough “creative spark” to be original. ... So all headnotes, even any that quote judicial opinions verbatim, have original value as individual works.

I personally don't think this sculpture metaphor works for verbatim quotes from judicial opinions.

1 more reply

bee_rider1y ago

> It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!

I guess it depends on how long the source is, and how long the collection of quotes is, if we’d expect multiple lawyers to converge on the same solution. I don’t think it is totally obvious, though…

I’m also not sure if that’s a generally good test. It seems great for, like, painting. But I wouldn’t be surprised if we could come up with a photography scene where most professionals would converge on the same shot…

zozbot2341y ago

If close paraphrase can be detected, this ought to be proof enough that some non-trivial element of creativity was involved in the original text. Because purely functional and necessary elements are not protected by copyright, even when they would otherwise be creative (this is technically known as the 'scenes à faire' case) - and surely a "quote" which is unavoidable because it factually and unquestionably is the core of the ruling would have to fall under that.

DrScientist1y ago

Isn't the argument that the act of selecting the right quote is the real work - and the work the copier avoided in the act of copying?

You could argue that all the words are already in the dictionary - so none of them are new, you are just quoting from the dictionary in a particular order......

The reason you have people, rather than computers interpreting the law, is you can make judgements that make sense. Fundamentally these laws are there to protect work being unfairly ripped off.

What was clearly done in this case was a rip-off which damaged the original creator - everything else is dancing on the head of a pin.

1 more reply

fncypants1y ago

I think this is the best takeaway. This case and its outcome is restricted to its facts. Most of the LLM activity today is very different than what happened here.

singleshot_1y ago

My experience using Westlaw Keycites at work is that they’re not primarily created by fishing a quote out of a holding, but instead by synthesizing a rule. If I want a summary, I read the Keycite; if I want a money quote, I root around in the case linked to the Keycite.

Have you seen different? I’m curious what area of law you practice and in what state, for comparison’s sake.

anon3738391y ago

Yeah, I'd agree that most are synthesized. But I do frequently see headnotes that are verbatim or nearly verbatim slices from the text. Just grabbing a case at random: Kearney v. Salomon Smith Barney, Inc., 39 Cal.4th 95 (2006). The 4th headnote reads:

> The federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.

And the opinion reads:

> [T]he federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.

6stringmerc1y ago

The crux is Fair Use and until lobbyists change the four factor test, AI training has an uphill battle in court. It’s a very disliked observation in this forum, but I stand by my principles on this one because the courts see it my way. Derivative works, especially by artificial means, simply fail the test miserably and that’s the truth.

greggyb1y ago

Collections of essays or poems are considered copyrightable. This seems analogous enough to me.

fsckboy1y ago

>the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark” ... when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis

... so it follows that it was then Ross's annotators showing the creative spark

reissbaker1y ago

I'll quote a longer portion of the transcript about generative AI, because I think it makes the opposite of your point:

Ross’s use is not transformative. Transformativeness is about the purpose of the use. “If an original work and a secondary use share the same or highly similar purposes, and the second use is of a commercial nature, the first factor is likely to weigh against fair use, absent some other justification for copying.” Warhol, 598 U.S. at 532–33. It weighs against fair use here. Ross’s use is not transformative because it does not have a “further purpose or different character” from Thomson Reuters’s. Id. at 529.

Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written. D.I. 723 at 5. That process resembles how Westlaw uses headnotes and key numbers to return a list of cases with fitting headnotes.

I think it's quite relevant that this was not generative AI: the reason that mattered is that "transformative" use biases towards Fair Use exemptions from copyright. However, this wasn't creating new content or giving people a new way to understand the data: it was just used in a search engine, much like Westlaw provided a legal search engine. The judge is pointing out that the exact implementation details of a search engine don't grant Fair Use.

This doesn't make a ruling about generative AI, but I think it's a pretty meaningful distinction: writing new content seems much more "transformative" (in a literal sense: the old content is being used to create new content) than simply writing a similar search engine, albeit one with a better search algorithm.

BoorishBears1y ago

I came here to point this out, and it's especially clear if you contextualize this with the original decision from September: https://www.ded.uscourts.gov/sites/ded/files/opinions/20-613...

They were doing semantic search using embeddings/rerankers.

The point that reading both decisions together compounds is that if they had trained a model on the Bulk Memos and generated novel text instead of doing direct searches, there likely would have been enough indirection introduced to prevent a summary judgement and this would have gone to a jury as the September decision states.

In other words, from their comment:

> But I'm not sure "generative" is that meaningful a distinction here.

The judge would not seem to agree at all.

qingcharles1y ago

Westlaw's headnotes are primarily just snippets of the case with tags attached. They are really crappy. I hate them. Some lawyers love them.

Westlaw protects them because they are the "value add." Otherwise their business model is "take published decisions the court is legally bound to provide for free and sell it to you."

An LLM today could easily recreate the headnotes in a far superior manner from scratch with the right prompt. I don't even think hallucinations would factor in on such a small task that was well regulated, but you can always just asterisk the headnotes and put a disclaimer on them.

Tteriffic1y ago

Exactly. Why use the headnotes at all?

I always thought they were obviously were copyrightable. Plus they’re not close to perfect either.

AlexCoventry1y ago

> You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.

Surely creating a general-purpose AI is transformative, though? Are you anticipating that AI companies will be sued for contributory infringement, because customers are using a general-purpose AI to compete with companies which created parts of the training data?

llamaimperative1y ago

IMO yes. The entire purpose of copyright law is to protect the incentive to create new material. A huge portion of the value prop of AI is that it captures the incentive normally bound for the creators of the training material (i.e. the whole point is you can ask the AI and not even see, never mind pay, the originator).

zozbot2341y ago

Ask the AI for what exactly? Factual information? That gets very low protection from a copyright point of view, especially when separate random answers by the AI will routinely show completely different rephrasings of the AI's response - implying that it can generalize well beyond the "expression" contained in any single answer, and effectively reference the underlying facts.

AlexCoventry1y ago

I'm not a lawyer, but I think the bar for contributory infringement is much higher than that. I think you'd have to find representatives of the defendants actually indicating somehow that people should use it that way. It seems to me that Grokster, etc.'s encouragement of their users to infringe copyright was an important factor in them losing this case, for instance.

https://supreme.justia.com/cases/federal/us/545/913/

1 more reply

Ajedi321y ago

Interestingly, almost the entirety of the judge's opinion seems to be focused on the question of whether the translated notes are subject to copyright. It seems to completely ignore the question of whether training an AI on copyrighted material constitutes making a copy of that work in the first place. Am I missing something?

The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:

> There is no factual dispute: Ross’s output to an end user does not include a West headnote. What matters is not “the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public for which it may serve as a competing substitute.” Authors Guild, 804 F.3d at 222 (internal quotation marks omitted). Because Ross did not make West headnotes available to the public, Ross benefits from factor three.

But he only does so as part of an analysis of whether there's a valid fair use defense for Ross's copying of the head notes, ignoring the obvious (to me) point that if no copyrighted material was distributed to end users, how can this even be a violation of copyright in the first place?

unyttigfjelltol1y ago

Ross evidently copied and used the text himself. It's like Ross creating an unauthorized volume of West's books, perhaps with a twist.

Obscurity ≠ legal compliance.

Ajedi321y ago

So the use of AI actually has nothing to do with the ruling here? This is just about the fact that Ross made one local copy of the notes and never distributed it?

brookst1y ago

How would training on copyrighted material be infringement in a way that merely producing the training material (but not iterating through training) would not be?

kevin_thibedeau1y ago

There were data brokers who literally paid people to transcribe phone books before OCR was a viable option. That was protected, as data isn't copyrightable. It isn't hard to argue that case law metadata is no different even though it includes textual descriptions (themselves taken from public documents).

musicale1y ago

> "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents"

This is a good distillation. A bit like "we trained our system on various works of art and music, and now it is being sold as a service that competes with the original artists and musicians."

bsder1y ago

AI has yet to demonstrate that it can do anything different from what a group of people could sit down and do. Sure, the AI may be able to do it faster, but there hasn't yet been anything demonstrated that exceeds what humans can do.

If it would be illegal for a group of people to do something, it is also going to be illegal for an AI do so.

Why is that so surprising?

echelon1y ago

If the copyright holders win, the model giants will just license.

This effectively kills open source, which can't afford to license and won't be able to sublicense training data.

This is very bad for democratized access to and development of AI.

The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.

If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.

dkjaudyeqooe1y ago

License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

To the contrary, this just means companies can't make money from these models.

Those using models for research and personal use wouldn't be infringing under the fair use tests.

marcus0x621y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Maybe the strategy is something like this:

1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.

2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models

The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.

2 more replies

AnthonyMouse1y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.

But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.

1 more reply

echoangle1y ago

They didn’t train it on every available copyrighted work though, but on a specific set of legal questions and answers. And they did try to license them, and only did the workaround after not getting a license.

1 more reply

mvdtnz1y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Oh no. Anyway.

mvdtnz1y ago

Open source model builders are no more entitled to rip off content owners than anyone else. I couldn't possibly care any less if this impacts "democratized access" to bullshit generators. At least if the big boys license the content then the rightful owners get paid (and have the option to opt out).

brookst1y ago

The copyright lobby has really done a number on public policy. Copyright was never meant to be perpetual.

I’m good with your proposal if we also revert to the original 14 year + 14 year extension model. As it stands the 120 year copyright is so ridiculously tilted that we should not allow it to extend to veto power over technical advancements.

1 more reply

vkou1y ago

I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.

Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.

JoshTriplett1y ago

> If the copyright holders win, the model giants will just license.

No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.

If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.

simonw1y ago

"The biggest models want to train on literally every piece of human-written text ever written"

They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.

1 more reply

pabs31y ago

Open source models can crowdsource open source training data. This was done for RNNoise for example.

alberto-m1y ago

> Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers

For me (Italian) this is amazing! Most Italian judges and lawyers write in a purposely obscure fashion, as if they wanted to keep the plebs away from their holy secrets. This document instead begs to be read; some parts are more in the style of a novel than of a technical document.

dkjaudyeqooe1y ago

> The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.

Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.

echoangle1y ago

How is the system inherently generative?

currymj1y ago

Generative is a technical term, meaning that a system models a full joint probability distribution.

For example, a classifier is a generative model if it models p(example, label) -- which is sufficient to also calculate p(label | example) if you want -- rather than just modeling p(label | example) alone.

Similar example in translation: a generative translation model would model p(french sentence, english sentence) -- implicitly including a language model of p(french) and p(english) in addition to allowing translation p(english | french) and p(french | english). A non-generative translation model would, for instance, only model p(french | english).

I don't exactly understand what this judge meant by "generative", it's presumably not the technical term.

2 more replies

Tanjreeve1y ago

Does it really matter what the judge calls it when the ruling is about its end effects and outcomes?

veggieroll1y ago

> Thomson Reuters prevailed on two of the four factors, but Bibas described the fourth as the most important, and ruled that Ross “meant to compete with Westlaw by developing a market substitute.”

Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.

But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.

It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.

Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed. This is inevitable now that it's at least partially framed in national security terms.

But I'd hope that this means there is a chance that if models have to train on all of human content, the weights will be available for free to all humans. If it requires massive copyright infringement on our content, we should all have an ownership stake in the resulting models.

johnnyanmac1y ago

>But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.

Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder. These aren't small time bloggers on the internet, these are large scale businesses.

>Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed.

The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing. But yes, I'm still frustrated by the hypocrisy.

Ukv1y ago

> Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder.

Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.

> these are large scale businesses.

I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses. Large-scale pretraining is common even for models that are not cutting-edge LLMs.

> The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing

As far as I'm aware, most of the lobbying in favor of stricter copyright has been done by Disney, Universal, Time Warner, RIAA, etc.

Not to say that tech companies have a consistent moral stance beyond whatever's currently in their financial self-interest, but I think that self-interest has put them in a position of supporting fair use and copyright safe harbors, opposing link tax, etc. more often than the the other way around - with cases like Authors Guild v. Google being a significant win for fair use.

johnnyanmac1y ago

>Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.

Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it.

>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.

Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made.

>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,

Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet.

1 more reply

toyg1y ago

> the current method for training requires this volume of data

This is one of those things that signal how dumb this technology still is - or maybe how smart humans are when compared to machines. A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.

I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.

gregschlom1y ago

> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.

Maybe not directly, but consider that our brains are the product of million of years of evolution and aren't a blank slate when we're born. Even though babies can't speak a language at birth, they already have all the neural connections in place in order to acquire and manipulate language, and require just a few years of "supervised fine tuning" to learn the actual language.

LLMs, on the other hand, start with their weights at random values and need to catch up with those million years of evolution first.

skeledrew1y ago

Add to this, the brain is constantly processing raw sensory data from the moment it became viable, even when the body is "sleeping". It's using orders of magnitude more data than any model in existence every moment, but isn't generally deemed "intelligent" enough until it's around 18 years old.

1 more reply

nitwit0051y ago

There can't be that much pre-built into the brain. There isn't that much dna, and only a portion of it can be going to the design of the brain.

A lot of what we're able to do has to be from some sort of generic capability.

johnnyanmac1y ago

sadly, those weights will not be inherited like they would to a baby. They'll be cooped up until the company dies, and that data probably dies with them. No wonder LLM has allegedly hit some stalls already.

dkjaudyeqooe1y ago

> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.

> I remember talking with friends 30 years ago

I'd say you're pretty old. How many years of training did it take for you to start producing good output?

The leason here is we're kind of meta-trained: our minds are primed to pick up new things quickly by abstracting them and relating them to things we already know. We work in concepts and mental models rather than text. LLMs are incredibly weak by comparison. They only understand token sequences.

layer81y ago

That's the point I think. It should be possible to require orders of magnitude less data to create an intelligence, and we are far from achieving that (including achieving AGI in the first place even with those huge amounts of data).

1 more reply

CobrastanJorji1y ago

We are unbelievably far from that. Everyone who tells you that we're within 20 years of emulating brains and says stuff like "the human brain only runs at 100 hertz!" has either been conned by a futurist or is in denial of their own mortality.

veggieroll1y ago

Absolutely! But the question is whether the next step-change in intelligence is just around the corner (in which case, this legal speedbump might spur innovation). Or, will the next revolution take a while.

There's enough money in the market to fund a lot of research into totally novel underlying methods. But if it takes too long, investors and lawmakers will just move to make what already works legal, because it is useful.

Intralexical1y ago

> I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.

Why would it be?

"It's inevitable that the Burj Khalifa gets built, once steel production gets high enough."

"It's inevitable that Pegasuses will be bred from horses, as soon as somebody collects enough oats."

Reducing intelligence to the bulk aggregate of brute "calculation power" is... Ironically missing the point of intelligence.

saulpw1y ago

> So the models are legitimately not viable without massive copyright infringement.

Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.

dkjaudyeqooe1y ago

> simply training a model on illegally distributed text should not be copyright infringement

You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).

One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.

saulpw1y ago

If that mechanical process is not reversible, then it's not a copyright violation. For instance, I can compute the SHA256 hashes for every book in existence and distribute the resulting table of (ISBN, SHA256) and that is not a copyright violation.

1 more reply

gruez1y ago

>One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.

Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.

aoanevdus1y ago

What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?

What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?

1 more reply

cycomanic1y ago

That's an interesting take, but false in a lot of juristictions. Even if we ignore question of if the model can distribute work, in many places even downloading content is illegal. Otherwise the person torrenting a movie would be totally in the clear, or thing about what MS would say if a company "just" downloads copies of Windows to use on their computers without ever distributing them.

gruez1y ago

>Otherwise the person torrenting a movie would be totally in the clear

Any examples of people being sued for merely downloading? "Torrenting" basically always involves uploading, even if you stop immediately after completion. A better test would be if someone was sued for using an illegal streaming site, which to my knowledge has never happened.

veggieroll1y ago

I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.

But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)

_DeadFred_1y ago

As long as someone give me the software software to run my business, that person might be in violation of copyright but I'm in the clear.

Simply running my business on illegally distributed copyrighted text/software/movie should not be copyright infringement.

layer81y ago

If you buy a machine that prints copies of copyrighted books (built into the machine), and you use that machine and then distribute the resulting copies, and the machine didn't come with a license allowing you to do so, I'm pretty sure that you are liable as well.

At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.

itishappy1y ago

You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.

tyfon1y ago

> Copyright is not about acquisition, it is about publication and/or distribution.

It would be interesting to see how this holds up in court.

"Your honor, I didn't watch the movie I downloaded, I only used it to train an AI."

I highly suspect it would not matter.

johnnyanmac1y ago

well I think that will be the final judgement. We'll treat training data more as distribution than as consumption. Things always get more complicated when you put stuff up for sale. I also can't necessarily get away with Making "Garry Botter" who got accepted into an Enchanter school and goes on adventures with Jon and Germione. Unless it's parody, you can only cut so close before you're just infringinng anyway despite making it legally distinct.

blibble1y ago

> Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI.

"a person reading" and "computer processing of data" (training) are not the same thing

MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement

Animats1y ago

This isn't really about "AI". It's about copying summaries. Google was fined for this in France for copying news headlines into their search results, and now has to pay royalties in the EU. Westlaw is a summarizing and indexing service for court case results. It's been publishing that info in book form since 1872.

Ross was trying to compete with Westlaw, but used Westlaw as an input. West's "Key Numbers" are, after a century and a half, a de-facto standard.[2] So Ross had to match that proprietary indexing system to compete. Their output had to match Westlaw's rather closely. That's the underlying problem. The court ruled that the objective was to directly compete with Westlaw, and using Westlaw's output to do that was intentional copyright infringement.

This looks like a narrow holding, not one that generally covers feeding content into AI training systems.

[1] https://apnews.com/article/google-france-news-publishers-cop...

[2] https://guides.law.stanford.edu/cases/keynumbersystem

zozbot2341y ago

The case involves headnotes, not just key numbers. Your links provide examples of such headnotes, which make it very clear that a lot of human creativity and judgment is involved in authoring them - they're not a matter of purely factual information, such as a phonebook. Thus, the headnotes are copywritten, and translating them to a different language doesn't negate that copyright. This looks like a slam dunk case, but it has very little to do with AI training as such - the AI was only used to create a kind of rough indexing over the translated text.

If this was only about key numbers, it might have gone the other way because the fact-like element there is considerably greater.

chefandy1y ago

Case law is public domain. You can publish digitized copies of Westlaw books with the headnotes, keys, and a couple of other property bits redacted. Any of their proprietary elements though, definitely including the key cites, are clearly a no-go. The headnotes not only require creativity and expertise to make, many lawyers consider them indispensable (though many other lawyers apparently throw shade at lawyers that rely on them.) And since most of the rest of the book is public domain, it’s one of the biggest, if not the biggest selling point for their texts. They famously vigorously defend their copyrights— the defendant surely knew what they were signing up for when they started doing this.

BoorishBears1y ago

> which make it very clear that a lot of human creativity and judgment is involved in authoring them

What's funny is that any SOTA LLM today could definitely author them, and even LexisNexis advertises the fact: https://www.lexisnexis.com/community/insights/legal/b/produc...

mmooss1y ago

TR may have intentionally chosen an easy battle to begin their legal war.

ascorbic1y ago

They began this case in 2020, before any of the most important models existed

preinheimer1y ago

Great. The stated goal of a lot of these companies seems to be “train the model on the output of humans, then hire us instead of the humans”.

It’s been interesting that media where watermarking has been feasible (like photography) have seen creators get access to some compensation, while text based creators get nothing.

rolph1y ago

rotate similar [but different] fonts [or character pages] over each character. the sequence represents data thus watermark.

WillAdams1y ago

but the font changes won't be expressed in the (plain text) output of the LLM.

yifanl1y ago

Presumably the font will represent letters to look like a different letter, making it not useful to LLMs scraping the site but useful for visual readers.

This would have detrimental effects to people who use screen readers or have their own stylesheets of course.

2 more replies

rolph1y ago

yes thats right a plain text will be distinctive from a watermark version thus outed as an automated forgery. vs incorrect watermark sugesting human attempt to forge, this introduces complications for the generation of output, namely conserving the cypher as well as making sense

simonw1y ago

Interesting to note from this 2020 story (when ROSS shut down) that the company was founded in 2014 and went out of business in 2020: https://www.lawnext.com/2020/12/legal-research-company-ross-...

The fact that it took until 2024 for the case to resolve shows how long the wheels of justice can take to turn!

qingcharles1y ago

Litigation takes forever. Especially when you factor COVID in. I'm still litigating cases from over a decade ago that are probably several years from resolution, just in the district court. Then you can spin through appeals courts for another five years. And that's civil.

Criminal, especially a death row case, can take 20+ years to exhaust every level of appellate review. In Illinois there are at least nine levels of review available to you without going through second rounds of review, state habeas, and collateral attacks like applications for clemency, pardons etc. If you're not paying for lawyers, expect each level to take around two years or more.

freeAgent1y ago

My father practiced corporate tax law and regularly had cases at trial that resolved issues from 20-30 years prior.

simonw1y ago

That's wild, I had no idea. I have trouble imagining a case where it's worth spending 30 years coming to a conclusion, but I guess that's one many reasons I'm not a corporate tax lawyer!

freeAgent1y ago

One such repeating case ended up settling for over $10B, so it was definitely worth it!

To clarify, they spent decades litigating the same fundamental issue for each year’s tax filings, with each filing year taking multiple years to get to court. The plaintiffs won every single case until the government finally settled all the remaining tax years for that amount. Each year prior was worth hundreds of millions.

jll291y ago

Note this case is explicitly NOT about large language model type AI - Ross' product is just a traditional search engine (information retrieval system), not a neural transformer a la ChatGPT.

About judge Bibas: https://en.wikipedia.org/wiki/Stephanos_Bibas

dkjaudyeqooe1y ago

The fair use aspect of the ruling should send a chill down the spines of all generative AI vendors. It's just one ruling but it's still bad.

palmotea1y ago

> The fair use aspect of the ruling should send a chill down the spines of all generative AI vendors. It's just one ruling but it's still bad.

So, in other words, it's good.

aurizon1y ago

At the heart of this is a very greedy racket:- court reporters who 'own' the copyright to every word spoken by anyone in court that they transcribe to a transcript that they do not own the source to (judges/witnesses/lawyers/defendants in truth own it) They then milk huge fees for these transcripts and limit use/access/derivative works with huge fees. An AI verbatim transcriber would up end them, so that will be prevented, as will anything that shakes the tree.

habinero1y ago

No, their work is valuable and they deserve to make money off of it.

The reason why it's valuable is it's transcribed live (usually with video) and is accurate and verifiable. Words and names are spelled correctly and speakers are correctly identified. Court reporters will stop speakers and ask for spelling or to repeat words.

AI transcriptions can't do that.

aurizon1y ago

It is valuable, but it should not be extortionable

jug1y ago

I spontaneously feel like this is bad news for open AI, while playing in the hands of corporate behemoths able to strike expensive deals with major publishers and top it off with the public domain.

I’m not sure this signals the end of AI and a victory for the human, but rather who gets to train the models?

varsketiz1y ago

Great decision for humans.

Is this type of risk the reason why OpenAI masquerades as a non-profit?

1 more reply

oidar1y ago

Ross intelligence was creating a product that would directly compete against Thomson Reuters. Pretty clearly not fair use.

ars1y ago

It would be quite an interesting result if we could have true General AI, but we don't simply because of copyright.

I'm aware this isn't a concern yet, but imagine if the future played out this way....

Or worse: Only those with really deep pockets can pay to get AI, and no one else can, simply because they can't afford the copyright fees.

gradientsrneat1y ago

Westlaw is to the legal profession what ResearchGate and others are to science research. They profit from information from the commons, and charge as much as the market will bear.

Only one of the many reasons the legal profession is so expensive.

nickpsecurity1y ago

Almost every article I read on fair use talked like I could only use small amounts while not competing with them. AI people focus on a tiny number of precedents that they stretch very far. A reasonable person wouldn’t come up with their interpretation of fair use after looking at how most examples play out in court.

It shouldn’t surprise the writer that the AI companies’ versions of fair use didn’t hold much weight. They should assume that would be true. Then, be surprised any time a pro-AI ruling goes against common examples in case law. The AI companies are hoping to achieve that by throwing enough money at the legal system.

MonkeyClub1y ago

From p. 6:

"But a headnote can introduce creativity by distilling, synthesizing, or explaining part of an opinion, and thus be copyrightable."

Does this set a precedent, whereby AI-generated summaries are copyrightable by the LLM owners?

teruakohatu1y ago

Ross Intelligence was more a search interface with natural language and, probably, vector based similarity. So I suspect they were hosting and using the corpus in production, not just training a model on it.

NewsaHackO1y ago

How does this affect LLM systems that already have their corpus integrated?

trod12341y ago

The judge ruled this as a violation of copyright. Its the same as hosting any copyright material absent a valid license, criminal copyright piracy.

They would need to figure out a way to prune the respective weights so that such material is not available, or risk legal fury.

Workaccount21y ago

They would just censor output.

Youtube doesn't need to figure out how to stop copyright material from being uploaded, they need to stop it from being shared.

anticensor1y ago

> They would need to figure out a way to prune the respective weights so that such material is not available, or risk legal fury.

You want to reliably train it away from outputting the undesired outputs, not keeping it ignorant about them.

mmooss1y ago

Thomson Reuters chose to sue Ross Intelligence, not a company like Google or even OpenAI. I wonder how deeper pockets would have affected the outcome.

I wonder how the politics played out. The big AI companies could have funded Ross Intelligence, who could have threatened to sabotage their legal strategies by tanking and settling their own case in TR's favor.

mmooss1y ago

I missed this line from the article:

Even before this ruling, Ross Intelligence had already felt the impact of the court battle: the startup shut down in 2021, citing the cost of litigation.

rudedogg1y ago

It being in the legal realm probably had some impact. These tools can be seen as an attack on your profession, and I’m sure that affects the decision, whether conscious or not.

mmooss1y ago

The judge is a user of Westlaw and not a shareholder, as is the judge's office. They would like Westlaw to be cheaper and easier to use. Westlaw takes the judge's work product and profits from it. Arguably, Westlaw should demand all judges recuse themselves!

r00fus1y ago

What's to say that's not the next step? First step, stop your competitors who are copying your IP.

Lawyers are gonna be happy is my thought.

vaadu1y ago

Does anyone think Deepseek or other non-western AIs will respect copyright?

This is going to make Deepseek and its kin much more valuable.

afarviral1y ago

If those 4 aspects are used to judge whether "fair use", I'd say that's the nail in the coffin, because of course it isn't fair use and that's totally fair. Here I was thinking "transformative" was somehow a sticking point in all this.

biohcacker841y ago

If copyright forces a diversity of AIs. That would be good.

Every AI company using its own created training, resulting in AIs that are similar but not identical, is in my opinion much better than one or very few AIs.

iandanforth1y ago

Establishing precedent by defeating an already dead company in court is neither impressive nor likely to hold up for other companies.

asadotzler1y ago

incorrect

rvz1y ago

See. The fair-use excuses that the AI proponents here were trying to hang on to for dear life have fallen flat on this ruling.

This is going to be one of many cases in which there will be licensing deals being made out of this to stop AI grifters claiming 'fair use' to try to side-step copyright laws because they are using a gen AI system.

OpenAI ended up paying up for the data with Shutterstock and other news sources. This will be no different.

Salgat1y ago

My biggest concern is, what happens when countries like China, who aren't restricted by this, far outpace western countries in this technology? Do we just shrug and accept our far inferior models? LLMs are a productivity multiplier (similar to a search engine), so it'll have a large impact on the economy if licensing costs prohibit large scale training.

esafak1y ago

"Our" models are not inferior. There is plenty of data, and the next frontier is prediction-time compute and data synthesis. Shouldn't the Chinese worry that they are depressing the publication of commercial IP?

Salgat1y ago

Our models are not inferior because we're not constrained by copyrights, which is the entire concern we're discussing.

blibble1y ago

once the case law is set, I look forward to suing everyone that's ever trained a model for $300,000 PER WORK each time they ingested my code from GitHub

whoever wrote those indemnity policies is going to regret it

anticensor1y ago

> once the case law is set, I look forward to suing everyone that's ever trained a model for $300,000 PER WORK each time they ingested my code from GitHub

Didn't you already share it on GitHub royalty-free?

blibble1y ago

no, it was under a very specific license that required attribution

xyzal1y ago

I can't understand how some commenters frame such a result as not good. The big players will have no problem licensing large corpora to train their models, while my tiny site won't be vacuumed (legally at least) by scrapers if I won't agree.

My willingness to upload my projects anywhere is in the historical lows given the current state, honestly.

2OEH8eoCRo01y ago

Fantastic news!

lazycog5121y ago

seems like delaware can't scare tech companies out of re-incorporating any faster

YesBox1y ago

Thanks. The article wasn't loading for me, just the headline and image and footer. I was about to leave thinking that's all there is.

dang1y ago

We detached this comment from https://news.ycombinator.com/item?id=43018356.

Your post is totally fine; I just want to save space at the top of the thread (where the parent is now pinned).

j / k navigate · click thread line to collapse

177 comments

EnderWT1y ago

https://archive.is/mu49I

JackC1y ago

Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers: https://storage.courtlistener.com/recap/gov.uscourts.ded.721...

anon3738391y ago

This is an interesting opinion, but there are aspects of it that I doubt will stand the test of time.

The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.

AnthonyMouse1y ago

> That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.

The "competing product" thing is probably the most extreme part of this opinion.

But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.

mountainb1y ago

2 more replies

DrScientist1y ago

The case looks pretty straightforward to me - they copied the notes ( human or machine doesn't really matter ) to directly compete with the author of the notes.

1 more reply

bryanrasmussen1y ago

pigbearpig1y ago

" court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim"

The judge:

"But I am still not granting summary judgment on any headnotes that are verbatim copies of the case opinion (for reasons that I explain below)"

anon3738391y ago

I personally don't think this sculpture metaphor works for verbatim quotes from judicial opinions.

1 more reply

bee_rider1y ago

zozbot2341y ago

DrScientist1y ago

Isn't the argument that the act of selecting the right quote is the real work - and the work the copier avoided in the act of copying?

You could argue that all the words are already in the dictionary - so none of them are new, you are just quoting from the dictionary in a particular order......

The reason you have people, rather than computers interpreting the law, is you can make judgements that make sense. Fundamentally these laws are there to protect work being unfairly ripped off.

What was clearly done in this case was a rip-off which damaged the original creator - everything else is dancing on the head of a pin.

1 more reply

fncypants1y ago

I think this is the best takeaway. This case and its outcome is restricted to its facts. Most of the LLM activity today is very different than what happened here.

singleshot_1y ago

Have you seen different? I’m curious what area of law you practice and in what state, for comparison’s sake.

anon3738391y ago

And the opinion reads:

6stringmerc1y ago

greggyb1y ago

Collections of essays or poems are considered copyrightable. This seems analogous enough to me.

fsckboy1y ago

... so it follows that it was then Ross's annotators showing the creative spark

reissbaker1y ago

I'll quote a longer portion of the transcript about generative AI, because I think it makes the opposite of your point:

BoorishBears1y ago

I came here to point this out, and it's especially clear if you contextualize this with the original decision from September: https://www.ded.uscourts.gov/sites/ded/files/opinions/20-613...

They were doing semantic search using embeddings/rerankers.

In other words, from their comment:

> But I'm not sure "generative" is that meaningful a distinction here.

The judge would not seem to agree at all.

qingcharles1y ago

Westlaw's headnotes are primarily just snippets of the case with tags attached. They are really crappy. I hate them. Some lawyers love them.

Westlaw protects them because they are the "value add." Otherwise their business model is "take published decisions the court is legally bound to provide for free and sell it to you."

Tteriffic1y ago

Exactly. Why use the headnotes at all?

I always thought they were obviously were copyrightable. Plus they’re not close to perfect either.

AlexCoventry1y ago

llamaimperative1y ago

zozbot2341y ago

AlexCoventry1y ago

https://supreme.justia.com/cases/federal/us/545/913/

1 more reply

Ajedi321y ago

The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:

unyttigfjelltol1y ago

Ross evidently copied and used the text himself. It's like Ross creating an unauthorized volume of West's books, perhaps with a twist.

Obscurity ≠ legal compliance.

Ajedi321y ago

So the use of AI actually has nothing to do with the ruling here? This is just about the fact that Ross made one local copy of the notes and never distributed it?

brookst1y ago

How would training on copyrighted material be infringement in a way that merely producing the training material (but not iterating through training) would not be?

kevin_thibedeau1y ago

musicale1y ago

> "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents"

This is a good distillation. A bit like "we trained our system on various works of art and music, and now it is being sold as a service that competes with the original artists and musicians."

bsder1y ago

If it would be illegal for a group of people to do something, it is also going to be illegal for an AI do so.

Why is that so surprising?

echelon1y ago

If the copyright holders win, the model giants will just license.

This effectively kills open source, which can't afford to license and won't be able to sublicense training data.

This is very bad for democratized access to and development of AI.

If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.

dkjaudyeqooe1y ago

License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

To the contrary, this just means companies can't make money from these models.

Those using models for research and personal use wouldn't be infringing under the fair use tests.

marcus0x621y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Maybe the strategy is something like this:

1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.

2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models

The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.

2 more replies

AnthonyMouse1y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

1 more reply

echoangle1y ago

1 more reply

mvdtnz1y ago

> License what? Every available copyrighted work? Even getting a tiny fraction is not practical.

Oh no. Anyway.

mvdtnz1y ago

brookst1y ago

The copyright lobby has really done a number on public policy. Copyright was never meant to be perpetual.

1 more reply

vkou1y ago

I don't have either a data center, or every single copyrighted work in history to import as training data to train my open source model.

Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.

JoshTriplett1y ago

> If the copyright holders win, the model giants will just license.

If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.

simonw1y ago

"The biggest models want to train on literally every piece of human-written text ever written"

1 more reply

pabs31y ago

Open source models can crowdsource open source training data. This was done for RNNoise for example.

alberto-m1y ago

> Here's the full decision, which (like most decisions!) is largely written to be legible to non-lawyers

dkjaudyeqooe1y ago

Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.

echoangle1y ago

How is the system inherently generative?

currymj1y ago

Generative is a technical term, meaning that a system models a full joint probability distribution.

I don't exactly understand what this judge meant by "generative", it's presumably not the technical term.

2 more replies

Tanjreeve1y ago

Does it really matter what the judge calls it when the ruling is about its end effects and outcomes?

veggieroll1y ago

Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.

But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.

It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.

johnnyanmac1y ago

>But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.

>Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed.

Ukv1y ago

> Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder.

Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.

> these are large scale businesses.

> The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing

As far as I'm aware, most of the lobbying in favor of stricter copyright has been done by Disney, Universal, Time Warner, RIAA, etc.

johnnyanmac1y ago

>Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.

>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.

>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,

1 more reply

toyg1y ago

> the current method for training requires this volume of data

gregschlom1y ago

> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.

LLMs, on the other hand, start with their weights at random values and need to catch up with those million years of evolution first.

skeledrew1y ago

1 more reply

nitwit0051y ago

There can't be that much pre-built into the brain. There isn't that much dna, and only a portion of it can be going to the design of the brain.

A lot of what we're able to do has to be from some sort of generic capability.

johnnyanmac1y ago

dkjaudyeqooe1y ago

> A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.

> I remember talking with friends 30 years ago

I'd say you're pretty old. How many years of training did it take for you to start producing good output?

layer81y ago

1 more reply

CobrastanJorji1y ago

veggieroll1y ago

Intralexical1y ago

Why would it be?

"It's inevitable that the Burj Khalifa gets built, once steel production gets high enough."

"It's inevitable that Pegasuses will be bred from horses, as soon as somebody collects enough oats."

Reducing intelligence to the bulk aggregate of brute "calculation power" is... Ironically missing the point of intelligence.

saulpw1y ago

> So the models are legitimately not viable without massive copyright infringement.

dkjaudyeqooe1y ago

> simply training a model on illegally distributed text should not be copyright infringement

You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).

saulpw1y ago

1 more reply

gruez1y ago

Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.

aoanevdus1y ago

What’s a “mechanical process”? If I read The Lord of the Rings and it teaches me to write Star Wars, is that a mechanical process? My brain is governed by the laws of physics, right?

1 more reply

cycomanic1y ago

gruez1y ago

>Otherwise the person torrenting a movie would be totally in the clear

veggieroll1y ago

I mean, you're right in the abstract. If you train an LLM in a void and never do anything with the model, sure.

_DeadFred_1y ago

As long as someone give me the software software to run my business, that person might be in violation of copyright but I'm in the clear.

Simply running my business on illegally distributed copyrighted text/software/movie should not be copyright infringement.

layer81y ago

At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.

itishappy1y ago

You might not be immediately liable, but that doesn't mean you're allowed to continue. I'd assume it's your duty to cease and desist immediately once it's pointed out that you're in violation.

tyfon1y ago

> Copyright is not about acquisition, it is about publication and/or distribution.

It would be interesting to see how this holds up in court.

"Your honor, I didn't watch the movie I downloaded, I only used it to train an AI."

I highly suspect it would not matter.

johnnyanmac1y ago

blibble1y ago

"a person reading" and "computer processing of data" (training) are not the same thing

MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement

Animats1y ago

This looks like a narrow holding, not one that generally covers feeding content into AI training systems.

[1] https://apnews.com/article/google-france-news-publishers-cop...

[2] https://guides.law.stanford.edu/cases/keynumbersystem

zozbot2341y ago

If this was only about key numbers, it might have gone the other way because the fact-like element there is considerably greater.

chefandy1y ago

BoorishBears1y ago

> which make it very clear that a lot of human creativity and judgment is involved in authoring them

What's funny is that any SOTA LLM today could definitely author them, and even LexisNexis advertises the fact: https://www.lexisnexis.com/community/insights/legal/b/produc...

mmooss1y ago

TR may have intentionally chosen an easy battle to begin their legal war.

ascorbic1y ago

They began this case in 2020, before any of the most important models existed

preinheimer1y ago

Great. The stated goal of a lot of these companies seems to be “train the model on the output of humans, then hire us instead of the humans”.

It’s been interesting that media where watermarking has been feasible (like photography) have seen creators get access to some compensation, while text based creators get nothing.

rolph1y ago

rotate similar [but different] fonts [or character pages] over each character. the sequence represents data thus watermark.

WillAdams1y ago

but the font changes won't be expressed in the (plain text) output of the LLM.

yifanl1y ago

Presumably the font will represent letters to look like a different letter, making it not useful to LLMs scraping the site but useful for visual readers.

This would have detrimental effects to people who use screen readers or have their own stylesheets of course.

2 more replies

rolph1y ago

simonw1y ago

Interesting to note from this 2020 story (when ROSS shut down) that the company was founded in 2014 and went out of business in 2020: https://www.lawnext.com/2020/12/legal-research-company-ross-...

The fact that it took until 2024 for the case to resolve shows how long the wheels of justice can take to turn!

qingcharles1y ago

freeAgent1y ago

My father practiced corporate tax law and regularly had cases at trial that resolved issues from 20-30 years prior.

simonw1y ago

That's wild, I had no idea. I have trouble imagining a case where it's worth spending 30 years coming to a conclusion, but I guess that's one many reasons I'm not a corporate tax lawyer!

freeAgent1y ago

One such repeating case ended up settling for over $10B, so it was definitely worth it!

jll291y ago

Note this case is explicitly NOT about large language model type AI - Ross' product is just a traditional search engine (information retrieval system), not a neural transformer a la ChatGPT.

About judge Bibas: https://en.wikipedia.org/wiki/Stephanos_Bibas

dkjaudyeqooe1y ago

The fair use aspect of the ruling should send a chill down the spines of all generative AI vendors. It's just one ruling but it's still bad.

palmotea1y ago

> The fair use aspect of the ruling should send a chill down the spines of all generative AI vendors. It's just one ruling but it's still bad.

So, in other words, it's good.

aurizon1y ago

habinero1y ago

No, their work is valuable and they deserve to make money off of it.

AI transcriptions can't do that.

aurizon1y ago

It is valuable, but it should not be extortionable

jug1y ago

I spontaneously feel like this is bad news for open AI, while playing in the hands of corporate behemoths able to strike expensive deals with major publishers and top it off with the public domain.

I’m not sure this signals the end of AI and a victory for the human, but rather who gets to train the models?

varsketiz1y ago

Great decision for humans.

Is this type of risk the reason why OpenAI masquerades as a non-profit?

1 more reply

oidar1y ago

Ross intelligence was creating a product that would directly compete against Thomson Reuters. Pretty clearly not fair use.

ars1y ago

It would be quite an interesting result if we could have true General AI, but we don't simply because of copyright.

I'm aware this isn't a concern yet, but imagine if the future played out this way....

Or worse: Only those with really deep pockets can pay to get AI, and no one else can, simply because they can't afford the copyright fees.

gradientsrneat1y ago

Westlaw is to the legal profession what ResearchGate and others are to science research. They profit from information from the commons, and charge as much as the market will bear.

Only one of the many reasons the legal profession is so expensive.

nickpsecurity1y ago

MonkeyClub1y ago

From p. 6:

"But a headnote can introduce creativity by distilling, synthesizing, or explaining part of an opinion, and thus be copyrightable."

Does this set a precedent, whereby AI-generated summaries are copyrightable by the LLM owners?

teruakohatu1y ago

NewsaHackO1y ago

How does this affect LLM systems that already have their corpus integrated?

trod12341y ago

The judge ruled this as a violation of copyright. Its the same as hosting any copyright material absent a valid license, criminal copyright piracy.

They would need to figure out a way to prune the respective weights so that such material is not available, or risk legal fury.

Workaccount21y ago

They would just censor output.

Youtube doesn't need to figure out how to stop copyright material from being uploaded, they need to stop it from being shared.

anticensor1y ago

> They would need to figure out a way to prune the respective weights so that such material is not available, or risk legal fury.

You want to reliably train it away from outputting the undesired outputs, not keeping it ignorant about them.

mmooss1y ago

Thomson Reuters chose to sue Ross Intelligence, not a company like Google or even OpenAI. I wonder how deeper pockets would have affected the outcome.

mmooss1y ago

I missed this line from the article:

Even before this ruling, Ross Intelligence had already felt the impact of the court battle: the startup shut down in 2021, citing the cost of litigation.

rudedogg1y ago

It being in the legal realm probably had some impact. These tools can be seen as an attack on your profession, and I’m sure that affects the decision, whether conscious or not.

mmooss1y ago

r00fus1y ago

What's to say that's not the next step? First step, stop your competitors who are copying your IP.

Lawyers are gonna be happy is my thought.

vaadu1y ago

Does anyone think Deepseek or other non-western AIs will respect copyright?

This is going to make Deepseek and its kin much more valuable.

afarviral1y ago

biohcacker841y ago

If copyright forces a diversity of AIs. That would be good.

Every AI company using its own created training, resulting in AIs that are similar but not identical, is in my opinion much better than one or very few AIs.

iandanforth1y ago

Establishing precedent by defeating an already dead company in court is neither impressive nor likely to hold up for other companies.

asadotzler1y ago

incorrect

rvz1y ago

See. The fair-use excuses that the AI proponents here were trying to hang on to for dear life have fallen flat on this ruling.

OpenAI ended up paying up for the data with Shutterstock and other news sources. This will be no different.

Salgat1y ago

esafak1y ago

Salgat1y ago

Our models are not inferior because we're not constrained by copyrights, which is the entire concern we're discussing.

blibble1y ago

once the case law is set, I look forward to suing everyone that's ever trained a model for $300,000 PER WORK each time they ingested my code from GitHub

whoever wrote those indemnity policies is going to regret it

anticensor1y ago

> once the case law is set, I look forward to suing everyone that's ever trained a model for $300,000 PER WORK each time they ingested my code from GitHub

Didn't you already share it on GitHub royalty-free?

blibble1y ago

no, it was under a very specific license that required attribution

xyzal1y ago

My willingness to upload my projects anywhere is in the historical lows given the current state, honestly.

2OEH8eoCRo01y ago

Fantastic news!

lazycog5121y ago

seems like delaware can't scare tech companies out of re-incorporating any faster

YesBox1y ago

Thanks. The article wasn't loading for me, just the headline and image and footer. I was about to leave thinking that's all there is.

dang1y ago

We detached this comment from https://news.ycombinator.com/item?id=43018356.

Your post is totally fine; I just want to save space at the top of the thread (where the parent is now pinned).

j / k navigate · click thread line to collapse