NY Times copyright suit wants OpenAI to delete all GPT instances (opens in new tab)

(arstechnica.com)

529 pointsjustinc86872y ago891 comments

891 comments

207 comments · 82 top-level

rich_sasha2y ago· 25 in thread

If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?

Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.

But then what do I know...

carlosdp2y ago

What you described is entirely fair use, actually.

Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)

9 more replies

heavyset_go2y ago

Another factor to consider is that neural nets can function as lossy compression, which becomes extremely evident when using models that are overfit.

Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

2 more replies

modeless2y ago

> To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.

Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.

4 more replies

sagarm2y ago

Isn't it totally normal to write articles / blog posts that effectively summarize, and often quote from, news articles?

2 more replies

amelius2y ago

> If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?

That's not a good question.

If I look out of my window and see my neighbor go to the shop, that's fine. If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.

Logic does not necessarily apply when scaling is involved.

2 more replies

oefrha2y ago

The real answer is it totally depends on whether your product grows to $10,000,000,000, and whether you pays part of it back. Search engines pay with referral traffic.

1vuio0pswjnm72y ago

If there is payment then usually there is an agreement. An agreement can limit fair use. Can the NYT, via an agreement, e.g., "Terms of Use", limit what the subscriber does with the articles. There is not much precedent that suggests otherwise.

Consider the analogy from libraries that want to do data mining.

"Unfortunately, in licenses for digital scholarly content the majority of content acquired by research libraries publishers often include terms that prohibit certain uses that would otherwise be allowable under the Copyright Act. For instance, licenses may require libraries or individual researchers to negotiate for otherwise lawful activities, such as text and data mining, and to pay exorbitant fees on top of the cost of the content itself. While new regulations allow researchers to circumvent technological protection measures to access copyrighted materials, licenses for that content may include terms that explicitly prohibit this circumvention. In many cases, these activities might actually increase the value of published material; for instance, if a data-mining project yields new knowledge about a topic covered in a journal, it may very well spark new interest in that journals content. Libraries and publishers have often assumed that license terms that restrict copyright exceptions are enforceable under state contract law. There is, however, surprisingly little case law on this point."

https://www.arl.org/wp-content/uploads/2022/07/Copyright-and...

Putting some string in a robots.txt to try to stop data collection is an amusing "solution". Should copyright owners have "Terms of Use" that limit usage for commercial "AI" purposes.

tauntz2y ago

> Is that fair use?

As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.

2 more replies

px432y ago

From what I can tell, this has nothing to do with LLMs at all. In the example in the article, the user is asking Bing to go fetch the contents of an article directly from the website, and print it out, which it dutifully does.

Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.

This article also claims that the screenshot is coming from ChatGPT when it clearly is not.

1 more reply

canttosme2y ago

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn :

> Implications: The Ninth Circuit's declaration that selectively banning potential competitors from accessing and using data that is publicly available can be considered unfair competition under California law may have large implication for antitrust law. [citation needed]

> Other countries with laws to prevent monopolistic practices or anti-trust laws may also see similar disputes and prospectively judgements hailing commercial use of publicly accessible information. While there is global precedence by virtue of large companies such as Thomson Reuters, Bloomberg or Google [or LexisNexis or Westlaw] effectively using web-scraping or crawling to aggregate information from disparate sources across the web, fundamentally the judgement by Ninth Circuit fortifies the lack of enforceability of browse-wrap agreements over conduct of trade using publicly available information.

1 more reply

paulddraper2y ago

> Something like, summarize all articles on US-UK relationships over past 5 years.

So like....Wikipedia, CliffNotes, encyclopedias, etc?

None of these pay royalties to original.

TeMPOraL2y ago

> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.

1 more reply

qeternity2y ago

> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it.

I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.

BlueTemplar2y ago

There's nothing wrong with scraping openly available data (including data openly available by mistake, as long as you are not aware of it, see the Bluetouff affair).

So the demand to destroy those databases seems very dubious to me.

Of course later violating fair use is another issue.

ks20482y ago

It would be nice to have a nice principled answer to this, but unfortunately, in our world, the answer is probably: if you start making LOTS of money doing this, they will come after you.

1 more reply

bnralt2y ago

As someone pointed out, plenty of blogs made money off of doing just that. Many people go to Reddit to read news article summaries (and often a comment just pastes the whole article verbatim), instead of paying a site like the New York Times. Twitter and other social media sites are full of people summarizing articles from the New York Times. Any late breaking news article from Wikipedia is going to be mostly summarizing information from reporters.

I think people severely underestimate how much they've grown accustomed to this information being freely available. It's easy to say "Well it shouldn't be available with ChatGPT," but if we actually put everything back behind a paywall and stopped people from doing things like writing blogs or newsletters that summarize the news, people here would get angry very fast.

oh_sigh2y ago

I agree with your IANAL take, but what about a situation with an extra level of indirection? So the service never reads actual NYT articles, but only reads blog/forum posts about NYT articles, and derives what is in the article from conversations about the article by people who have read it. Is that legal now?

charcircuit2y ago

This analogy fails to capture the transformative nature of these models. Hosting a derivative work that is also a news article is not transformative. Hosting a next word completer is very different than a news article and can't be used as a substitute.

jojobas2y ago

Can you read all of NYT and other things, and answer others' questions based on your knowledge? I'd imagine you can. I'm afraid you can't sidestep the question whether an LLM is more like a person who's read a lot or an archive/index.

superb-owl2y ago

> A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all.

Google has been accused for years of replacing sources with their "One Box"--the big answers at the top of the page, which are usually pulled from or corroborated by search results. They don't want you to leave the search results page (where the ads are).

1 more reply

brookst2y ago

How about if you read the paper every day and write opinion pieces about world events? Fair use?

1 more reply

urbandw311er2y ago

I believe that part of the law suit contends that the content wasn’t able to be scraped “legally” as you put it. Instead they show that ChatGPT will regurgitate verbatim excerpts from articles that are behind the paywall.

JCharante2y ago

But is it legal for me to read the NY Times about a war, and then charge people to interview me as an "expert"?

papruapap2y ago

Using similar logic NYT should pay all actors involved in their articles.

logicchains2y ago

> Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee.

>Is that fair use? IANAL, but doesn't sound like it.

If you pay someone to do the summarisation for you, then you publish the content and charge a fee for it, you're the one liable, not the person you paid to summarise it for you. Similarly if you ask GPT to do it for you, then publish it, you're liable for what you publish; GPT is just a summarisation tool.

2 more replies

groceryheist2y ago· 13 in thread

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."

This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?

visarga2y ago

> Just learn to recognize and punish plagiarism via RLHF.

This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.

But this will blow up in their face. Let's see:

- AI companies will start investing much more in content attribution

- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret

- Then people will start seeing a chilling effect on creativity

- We must also check NYT against all the other sources, not everything the write is original

3 more replies

colechristensen2y ago

I think NYT is going to win.

LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.

The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.

AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.

There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.

Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.

2 more replies

jahewson2y ago

Many instances of fair use involve verbatim copying. The important questions surround the situation in which that happens - not so much the copying. NYT is in uncharted territory here.

1 more reply

furyofantares2y ago

> Just learn to recognize and punish plagiarism via RLHF.

I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.

Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.

That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.

2 more replies

peyton2y ago

Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.

Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?

3 more replies

hn_acker2y ago

> This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring.

It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:

> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.

The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)

[1] https://en.wikipedia.org/wiki/Substantial_similarity#Substan...

1 more reply

kromem2y ago

This isn't an issue with training, it's an issue with usage.

Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.

But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.

The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?

Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?

1 more reply

bertil2y ago

Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.

I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).

1 more reply

joe_the_user2y ago

Any lawsuit makes all the claims it can and demands every sort of relief it might plausibly have. That's not to say that's how it should be (it can have awful results), just to say that's what to expect (and hope courts only considers the reasonable claim - "stop freely sharing our data" and avoids ridiculous/anti-fair-use claim "you can't even store our data").

The thing about you claim, "Just learn to recognize and punish plagiarism via RLHF" is that we've had an endless series of prompt exploits as well as unprompted leakage and these demonstrate that an LLM just doesn't have fixed border between its training data and its output. This will it basically impossible for OpenAI to say "we can logically guarantee ChatGPT won't serve your data freely to anyone".

intrasight2y ago

Yeah, no - that proposal is no good. The correct solution is to have machine learning be more like human intelligence. You can't ask me to plagiarize a New York Times article. Not because of prompt rule violation but because I just can't. It's not how humans train (at least most).

1 more reply

JumpCrisscross2y ago

> Just learn to recognize and punish plagiarism via RLHF

OpenAI has created a $100bn company on this transfer. The Times may have an interest in a material fraction of that wealth.

1 more reply

amadeuspagel2y ago

> The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.

Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?

spacecadet2y ago

Transformations are happening. Maybe if the output is verbatim afterwards, than that says something about the outputs originality all along... or am I a troll?

2 more replies

munchinator2y ago· 13 in thread

It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.

And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.

And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?

tzs2y ago

> I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP

As you noted it is not the norm to post pirate links here for IP other than news articles, but that doesn't mean that a lot of people think it is not OK to pirate those other forms of IP.

In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").

The impression I've gotten is that piracy of nearly everything is widely felt to be OK here. Information wants to be free, yada yada.

About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.

10 more replies

unyttigfjelltol2y ago

Historically newspapers leaned more on competition law than copyright, because their pages are supposed to be filled with non-copyrightable facts.[1] Copying part, but not all, of a factual article, significantly after the relevant event, was considered to be a promotion (not unfair competition) and a nice thing to do for the journalists. Things change, people lose sight of the original principles.

[1] https://en.m.wikipedia.org/wiki/International_News_Service_v...

3 more replies

cesarb2y ago

> Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim. [...] And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.

If the story was linking directly to the "book, TV show, movie, video game, album, comic book, etc", and the link only worked for some people while others randomly got a login request or similar, you'd also see the top comment being a link to an archived version which avoids the login screen. That is: the main difference is that the archive link has the exact same content as the link submitted in the story, only bypassing the login screen that some people see. And the only reason the archive site has the content is that it didn't get the login screen; if everyone always got the login screen, what you would see on the archive site would be the same login screen.

4 more replies

Erratic65762y ago

I find “4nn4’$ 4rch1v3 dot ORG” actually way better than pirate bay for pirating knowledge.

It’s amazing the amount of books that copyright laws prevent us from finding

https://www.theatlantic.com/technology/archive/2012/03/the-m...

1 more reply

lexicality2y ago

Funny, I don't see it as a moral thing but more a "what can you get away with" thing.

I fully assume that if I was to post a magnet link to a torrent for whatever the link was about, I would be banned.

Morally speaking, I think it's perfectly reasonable to download a copy of something and either read the relevant info for my current task or to sample it to decide if I want to buy it. I see it no different to using the library or browsing at a book store.

Perhaps once news organisations can work out how to effectively wield the DMCA hammer against archive links we'll see the practice of posting them stop.

2 more replies

bnralt2y ago

This tendency at Hacker News are also much more of a threat to The New York Times than what Open AI is doing. Even the places like blogs/Reddit/social media submissions that summarize the article and post the relevant quotes. Unlike the summary of a movie, summarizing all of the relevant parts of a news article is extracting almost all the value from it, and giving it away for free.

And the vast majority of people read news for it's breaking content, not for its archived content from years before (and I say this as someone who has often recommended the latter, but has gotten very few people to do so). So giving people that free breaking content (either in its entirety like on Hacker News, or summaries like you see all over social media) is actually a direct competition to the news business in a way that training an LLM on an article from months/years back isn't.

1 more reply

Germont2y ago

To me, there is a sense that the news, which is real information about the society that we currently live in, should be availabe to all participants of that society. The notion of being a good citizen requires that one stays informed. Books, movies, videogames etc. don't have that role and are more consumption goods.

8 more replies

Yizahi2y ago

Good comment, it was very funny to see how people desperately try to find moral justification for pirating media A but not B. "It's apples to oranges, you see, there are less letters in the NYT article than in the book and they are rendered differently, so it is ok to pirate their work. I did nothing wrong!" :)

5 more replies

jalapenos2y ago

If ChatGPT is based on neural networks, with no actual save-and-replicate facsimile behaviour, it no more "copies" original work than I do when I tell you about the news article I read today.

I'd say the only real reason the Piratebay links thing you mentioned is not the norm is purely because those media sources have done a better job of striking fear into people doing that, so it's gone more underground. I.e. they're better terrorists.

There's no fundamental, moral reason why Piratebay links being posted and raised to the top would be wrong.

3 more replies

perihelions2y ago

If it takes 120 seconds to read a newspaper article, the archive.is workflow is a significant overhead over that, a significant friction. Those links are a courtesy to other HN readers. This is very different from the economics of buying and reading a book.

"Piracy is almost always a service problem and not a pricing problem."

edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.

[0] https://news.ycombinator.com/item?id=26174269 ("Before buying a NYT subscription, here's what it'll take to cancel it", 812 comments)

1 more reply

iinnPP2y ago

The archive link doesn't threaten their jobs and helps them avoid paying for NYT. It's NIMBY, or rather it's true form of NIIIM (Not if it impacts me).

Hypocrites are EVERYWHERE and are the majority.

1 more reply

DennisP2y ago

I wouldn't say OpenAI has exactly the same attitude, since they also pulled in thousands of books. Their position has been that it's not piracy, since they don't republish the books; effectively the AI just reads them and learns from them. If GPT can be made to reproduce the original articles, that's a more difficult argument to make.

2 more replies

octacat2y ago

At least people do not obscure who is the original author of the content (so, if people like NYT articles - they could go and subscribe for more). Kinda "free advertising" (which still hurts the publisher in many cases, though). Same with search engines - as long as engine brings clicks - people are happy. If search engine just grabs the info and never redirects the user to the site - what is the point for the site to exist to begin with?

a_wild_dandan2y ago· 10 in thread

The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:

1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.

2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.

I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.

mat02y ago

I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.

Analyzing the factors involved for a "fair use" consideration:

Purpose and Character of the Use: While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.

Nature of the Copyrighted Work: Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.

Amount and Substantiality of the Portion Used: In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.

Effect on the Market Value: NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.

3 more replies

ciabattabread2y ago

> rent seeking media companies

Rent seeking? Media companies that actually create content are rent seeking? Versus the garbage hallucinations AI creates?

2 more replies

nozzlegear2y ago

> 2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?

https://finance.yahoo.com/news/beijing-tries-regulate-china-...

https://nitter.unixfox.eu/CDT/status/1625936306814717952?337...

I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.

truculent2y ago

Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.

1 more reply

Vicinity96352y ago

Rule 1 of the Internet: If you put it on the Internet, it's not yours anymore.

You don't have to agree with it. You don't have to like it. But if you accept it and live by it, it's much harder to get burned.

1 more reply

DalasNoin2y ago

About your 1. point: you can't possibly know that future models will be trained exclusively on synthetic data without any hit to performance. It is also not easy to reword the entire copyrighted training corpus without introducing errors or hallucinations. And you assume that this is just a fact?

Your second point reminds me a bit of 'War with the Newts' where humanity arms a race of sentient salamanders until they overthrow humanity. How could we not arm our newts if Germany might be arming theirs?

I also think basically everything else you wrote is wrong.

notahacker2y ago

If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....

yieldcrv2y ago

I think it can be simultaneously true that NYT is accurate in their complaint, while having no legal remedy for this and that there shouldn’t be.

There are plenty of large companies in other sectors that acknowledge there are limited legal remedies for them if someone copies some aspect of their business or name.

maxlin2y ago

This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.

bonzini2y ago

> GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct

Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.

fasterik2y ago· 5 in thread

I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.

mynegation2y ago

Suit claims that GPT reproduced passages from NYT almost verbatim.

3 more replies

dahart2y ago

This seems like a reasonable opinion when you think about the training data size and imagine that any given output is some kind of interpolation of some unknown large number of training examples all from different people. If it’s borrowing snippets from tens or hundreds or thousands of sources, then who’s copyrights are being violated? Remixing in music seems to be withstanding some amount of legal scrutiny, as long as the remix is borrowing from multiple sources and the music is clearly different and original.

It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?

ramesh312y ago

It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.

2 more replies

agentgumshoe2y ago

What if I ask ChatGPT to print the article verbatim as sourced, from its own dataset?

1 more reply

neop1x2y ago

Paywalled content as well?

andy992y ago· 5 in thread

This, or a lawsuit like it is going to be the SCO vs IBM of the 2020's, to wit: a copyright troll trying to extract rent, with various special interests cheering it on to try and promote their own agenda (ironically it was Microsoft that played that role with SCO). It's funny how times have changed and at least now a louder group seem to be on the troll's side. I hope to see some better analysis on the frivolity of this come out. There may be some commercial subtlety in specific cases that doesn't depend on scraping and training, but fundamentally using public internet data for training is not copying, is fair use, and is better for society as a whole than whatever ridiculous alternative might be proposed.

edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.

jamiek882y ago

Using the words troll and frivolous undermines your otherwise decent point and in fact goes against your point.

The law isn’t settled, it’s a genuine legal question mark.

It ain’t frivolous or trolling or ridiculous.

phkahler2y ago

>> There may be some commercial subtlety in specific cases that doesn't depend on scraping and training

The key is to stop calling it "training" and use "learning" or just "reading".

The argument from NYT will probably be that LLMs are just a fancy way to compress or abstract information and spit it back out. In which case "training" seems to support their case?

1 more reply

logicchains2y ago

>It's funny how times have changed and at least now a louder group seem to be on the troll's side

Because for many people, their views on current events are whatever the "thought leaders" working for the NYT and similar publications tell them to think.

xbar2y ago

That is an irrelevant comparison.

This is theft and monstrous profit from theft. For actual justice this should be a class action suit of the world vs. OpenAI/Microsoft and the financial consequences should be company-ending for OpenAI. Otherwise, you have incented everyone in the AI industry to steal as much as they can for as long as they can.

dannyr2y ago

NYTimes has a paywall. Is that public internet and therefore fair use?

1 more reply

habosa2y ago· 4 in thread

People who think the examples the lawsuit are “fair use” need to consider what that would mean. We’re basically going to let a few companies consolidate all the value on the Internet into their black boxes with basically no rules … that seems very dangerous to me.

I hope a court establishes some rules of engagement here, even if it’s not this case.

serjester2y ago

I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.

2 more replies

w4ffl352y ago

Scraping is legal, and this seems like a transformative work to me.

1 more reply

bart_spoon2y ago

The opposite is also concerning. IP law has always been convoluted, messy, contradictory, and morally ambiguous. The complaints of IP violation by LLMs are simply taking these inherent flaws and making them immediately obvious, forcing decisions that ultimately will set precedents on the legality of human thought that I don’t think anyone will be comfortable with. People understandably see OpenAI and Microsoft as potentially dangerous to be given so much leeway, but fail to consider on the flip side companies like Disney who have already more or less dictated the majority of copyright law for decades now. They must be chomping at the bit at the legal precedents potentially coming down the pipeline that call into question the ability to interact with any kind of media or information at any level without potentially being on the hook monetarily.

I think all this is doing is making us realize that we have built a massive economic system on a fundamentally flawed idea of ownership over ideas, and the only two solutions will be to tear up the rule book, which will be extremely painful, or double down, which will be fatal.

stainablesteel2y ago

a court has established this already

in japan, where they said anything goes for ai

so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it

1 more reply

logicchains2y ago· 4 in thread

NYT's perspective is going to look so stupid in future when we put LLMs into mechanical bodies with the ability to interact with the physical world, and to learn/update their weights live. It would make it completely illegal for such a robot to read/watch/listen to any copyrighted material; no watching TV, no reading library books, no browsing the internet, because in doing so it could memorise some copyrighted content.

type_Ben_struct2y ago

I disagree. The verbatim part is the problem. You’re drawing a comparison to how humans operate except we’re not allowed to operate like that.

While harder to do as a human, if memorised a copyrighted book and then did a live reading on TV, or produced replicas from memory and sold them (the most comparable example), I’d be sued.

Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.

2 more replies

makeitdouble2y ago

Memorising isn't the issue. It's providing it back verbatim and/or cutting access to the source.

You'd get the same problem with someone with a photographic memory who a group of people would turn to recite them the news instead of buying the newspaper.

As of now public performance of copyrighted material is infringement.

1 more reply

CJefferson2y ago

Are those LLMs independant citizens we are going to give rights to? Then I'm fine with that.

Are they all owned by one mega-corporation, which is going to do as capitalism does, and use them to squeeze money out of all of us? Then I'm happy to ban them.

1 more reply

ramraj072y ago

Will it? If the LLM in the body is allowed to read nytimes on a tablet I'm sure they wouldn't care.

1 more reply

jrockway2y ago· 4 in thread

I read about this in the Times today (and am surprised that it wasn't on HN already).

My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.

I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.

I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.

SilverBirch2y ago

It's extremely speculative to claim that LLM models are basically doing what humans do. There is very clearly something that isn't right about that because in order for a human to learn to speak and converse and they don't need to imbibe the entire corpus of all written text in human history - which is basically what we're doing with these LLMs. What we're giving them is vast amounts of data which is totally unlike how humans work. There's very clearly some gap here between what a LLM is doing and what a human is doing. So you can't use that as a basis to justify why it's ok for OpenAI to operate like this.

To put it another way, let's say I turn the dial all the way the other way, I train the worlds crappest LLM on NYT material, it massively massively overfits and all it will ever return is verbatim snippets of the NYT. Is that copyright infringement?

The core part of the argument here is actually just that OpenAI doesn't want to adhere to what the current standard is for using copyrighted material, if you want to use it and create something new with it you need to license the material. Since OpenAI's LLM isn't actually like a human it needs to license such a vast dataset that it would be uneconomical to run the business without stealing all the content.

hsbauauvhabzb2y ago

My hard drive can - bit for bit - recall video files. If I serve them to other people on the internet without permission of the copyright holder, that’s called piracy.

2 more replies

tsimionescu2y ago

In general, if you perform copyrighted works you are doing copyright infringement. There are certain exceptions (personal use, education, very small fragments with proper attribution, maybe a few others) but whether you are reading it aloud from a book or performing it from memory makes no difference.

So, if you setup a service like ChatGPT but powered by humans responding real time to queries, and these humans would occasionally reproduce large chunks of NYT articles, they and the service itself would be liable for copyright infringement. Even if they were all reproducing these from memory.

Now, this is somewhat different from the discussion of whether training the model on the copyrighted data, even if it had effective protections from returning copies of it, constitutes copyright infringement in itself. I believe this is a somewhat novel legal question and I can think of no direct corollaries.

I certainly don't think we can just handwave and say "at some level, when a human reads a copyrighted work, they are doing the same thing", because we really don't know if that is true. Artifical neural networks certainly have no direct similarity with the neural networks in the brain as far as we can tell. And, even if they did, there is no reason to give a machine the same rights that a human has - certainly not until that machine can prove sentience.

pests2y ago

I don't agree that an LLM is doing what we are doing.

"Its what we do all the time" is a major assumption

throwaway4good2y ago· 3 in thread

The lawsuit itself (which arstechnica links to):

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.

Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.

As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.

vanchor32y ago

> Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.

I imagine the goal is closer to "enough that no one notices we stole it", either in a way that it's not easily discoverable or even when directly analyzed there's enough plausible deniability to scrape by.

Jackson__2y ago

I wonder how they got these results, seeing as they are not showing any of the usual UI's (i.e. ChatGPT/Copilot).

It makes it difficult for me to ascertain whether it is repeating from it's training data, or they committed the same mistake as the OP article of using Copilot, which ends up googling(binging?) the article first, before replying.

jprete2y ago

The answer to the "closedness" is externally controlled audits.

wouldbecouldbe2y ago· 3 in thread

We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.

But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.

The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.

logicchains2y ago

> it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.

It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.

5 more replies

gumballindie2y ago

> We developers pretend that LLM's are akin to humans and that they've been educational material.

Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.

mikewarot2y ago

>It's much simpler, proprietary writing is now integrated into the source code of OpenAI

The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.

None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.

LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.

aurareturn2y ago· 3 in thread

Companies that have content all see dollar signs.

NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.

They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

mvdtnz2y ago

NYT do not "have" content, they create content. It's their raison d'etre.

1 more reply

up2isomorphism2y ago

"They" also include the people working there. Why someone work with full time writing articles should give the work for free just let someone to train it and make money out of it as a consequence?

4 more replies

MuffinFlavored2y ago

> They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

Can you make the argument this was their fault for not having forward vision/being asleep at the wheel and "accidentally, in hindsight" letting OpenAI/others have free, open, unlimited access to their content?

2 more replies

tarruda2y ago· 3 in thread

Would be funny if NT Times won this and all commercial LLMs were shut down.

Then LLMs would be distributed only via torrents, like most copyright infringing media.

realusername2y ago

They would still thrive but in other countries with other legal frameworks. The concept is way too valuable to disappear.

1 more reply

sgt1012y ago

What will happen in this case is that large content providers will get paid directly and smaller content providers will get rolled up into a licensing bag and get small indirect payouts. For example, we might see a model where people who's books have been used will get a pay out proportionate to the sales of the book (perhaps), so if your books sells just a few thousand copies expect $20 but if you sell millions expect $20k

LLM's will become more expensive and less attractive as money printers, this will screw with the business models of the direct provision folks like OpenAI, MS and Google, MS and Google will only shed tears for money spent while OpenAI will just not have as good an income stream until they think of something new.

2 more replies

__loam2y ago

Making these things anathema to commercial interests and making training them at scale legally perilous would be a huge win.

3 more replies

munchinator2y ago· 3 in thread

Why hasn't the Times also sued the Internet Archive? They've tried to block both the Internet Archive [1] and Open AI [2] from archiving their site, but why have they only sued OAI and not IA? The fact that they haven't sued IA which has comparatively little money would seem to indicate that this is not about fair use per se, but simply about profit-seeking and the NYT is selecting targets with deep pockets like OAI/MS.

[1] https://theintercept.com/2023/09/17/new-york-times-website-i...

[2] https://fortune.com/2023/08/25/major-media-organizations-are...

Shrezzing2y ago

Copyright doesn't stop the collection of content, it stops the copying, processing, & redistribution of content. Internet Archive acts as a library, so its widely accepted as fair-use when it makes collections of webpages available.

OpenAI's distribution is materially different to that of a library, so it's not a like-for-like comparison.

One of the main tests of copyright law (at least in the US) is if the entity distributing is _selling_ the copied/derivative work. It's unambiguous that OpenAI is selling something akin to derivative works, which is why NYT feels they can go after this claim. Meanwhile IA's operations don't create sales or incur profits, so while NYT's legal team may be able to establish that copies have been distributed, without the _sale_ aspect of the infringement, judges aren't guaranteed to side with NYT in an legally expensive PR nightmare.

killingtime742y ago

What's wrong with that? If I was the NY Time's lawyers that what I would advise. What would it serve to bankrupt the IA, they can't pay anyway? These are corporations enforcing their rights against one another.

There is nothing wrong with profit seeking from your copyright. That's literally their entire business model...they publish copyrighted content which they sell for a subscription.

OpenAI and others could easily have negotiated a licence instead of just using the data. They bet that it would be cheaper to be sued, lets find out if they bet correctly.

Tangentially that's what Apple did with the sensor in their watch, it doesn't always pay off.

1 more reply

sgt1012y ago

I think that the moment you start making big money from someone else's business is the moment that they get riled. That and when you really hurt their business. I suspect that the NYtimes thinks that IA is damaging them in the order of (possibly) $100k pa, and that it thinks that OpenAI is making in the order of $10M's from their content (and possibly doing some damage as well). It's an easy commercial decision to ignore one and go after the other - especially as going after IA is going to create some backlash as well.

bdd8f1df777b2y ago· 2 in thread

I see few people here bring this up, so let me:

The US constitution says, The Congress shall have Power

> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;

So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.

madeofpalk2y ago

Wishful thinking. Just as equally, NYT’s right to copyright over its material, in order to have a functional press, is enshrined the constitution. Anyone threatening that copyright could be unconstitutional.

I think we all agree that no one is entitled to “progress of science” at any cost - as a straw man, killing hundreds of newborn babies for scientific research is not great - so we use ethics and the legal system to find the line of what’s acceptable.

I don’t know exactly what NYT is asking for here, but the two options aren’t unconsented training vs nothing at all. NYT could license, for a fee, its content to OpenAI. It’s pretty common for scientists to have to pay for materials!

globular-toast2y ago

Current AI is useless without people writing the articles in the first place.

JackFr2y ago· 2 in thread

I think LLMs may really change the IP landscape.

Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.

flanked-evergl2y ago

> But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era.

Is there any research into how people from earlier eras thought about it? And should all laws that seemed ludicrous to someone from an earlier era be discarded? If not, how exactly do we determine the relevance of what someone from an earlier era would think about our laws?

narenkeshav2y ago

I remember a case where the court did not allow ID to patent "First person shooters"

This rings similar.

strangus2y ago· 2 in thread

Next up, Microsoft acquires the New York Times forming MSNYT

playingalong2y ago

... New Roman

unsupp0rted2y ago

This is not impossible, and perhaps not even unlikely

biglyburrito2y ago· 2 in thread

TLDR:

"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""

chongli2y ago

This is what lawyers are paid for. They ask for the max because there’s no harm in doing so. Everyone knows there’s little meaning to that.

1 more reply

downWidOutaFite2y ago

Wow they want to kill it. I wonder if we've just lived through the golden Napster era of LLMs.

2 more replies

bigmattystyles2y ago· 2 in thread

Not that it would solve this, but how hard would it be for ChatGpt or other problems to cite the sources used in a response. Is that difficult to capture and tag to 'knowledge' within a LLM? It could be a best of both worlds type situation if LLMs cited sources and linked to the source itself. Isn't that what happened with Google News's home page? I seem to recall that when Google took it away in some markets, at the behest of the news orgs, they quickly reversed course as their traffic plummeted.

qznc2y ago

This is not possible. There is no database of sources inside an LLM. Just like the knowledge in your brain does not have sources attached.

For an example, you referenced "what happened with Google News's home page". Could you give me your source? You could probably search for some suitable article for a reference, but you don't know a source from your memory.

jejeyyy772y ago

not likely with the way these models have been trained - its basically broken down into sub-words that are all mashed together into probabilities.

Kon-Peki2y ago· 1 in thread

> Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.

This is interesting. The NYT is specifically saying that the way you use an LLM impacts what you can legally use for training the LLM. They're firing shots at the big guys trying to sell access to an LLM, but not at the little guy self-hosting for fun or academics doing research.

lesuorac2y ago

They're probably saying that because its what the supreme court said except about a human copying a work created by another human.

https://www.npr.org/2023/05/18/1176881182/supreme-court-side...

1 more reply

ssijak2y ago· 1 in thread

If I create a news website where I write articles in the following way:

- Read 20 different news websites and their story on the same event/topic

- Wait an hour, grab a cup of coffee

- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head

- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.

- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.

Is that fair use? Am I infringing on copyright?

gumballindie2y ago

If you are piece of software then yes.

1 more reply

elif2y ago· 1 in thread

At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.

ryukoposting2y ago

It already is, but I don't think this is a good example. NYT has a legitimate case here. They own the material they publish, and GPT-4 is shown to be able to recall entire articles verbatim. That's a violation, clear as day.

The thing about lawsuits is that you make dozens of claims, and the court can rule in favor of some of them, and against others. The question of "is LLM training fair use?" hasn't made it to a high court yet. The court could very easily rule against everything else in the suit.

1 more reply

chris_wot2y ago· 1 in thread

Fair use is something Wikipedians dance around a fair amount. It also meant I did a lot of reading about it.

It’s a four part test. Let’s examine it thusly:

1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.

2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.

3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.

4. Effect on the market. Almost certainly negative for the NYT.

IMO, OpenAI cannot successfully claim fair use.

EMIRELADERO2y ago

You're getting mixed up. When applying the four factors, you need to individually separate all the uses. So you would need to repeat the fair use test for every alleged type of infringement. This means that the scraping from the public internet to OpenAI's dataset storage cluster is one instance where the full analysis of the 4 must take place, then the training itself, so another full analysis, then the distribution of model outputs, another one, etc.

1 more reply

dang2y ago· 1 in thread

Related. Others?

NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - https://news.ycombinator.com/item?id=38784194 - Dec 2023 (80 comments)

The New York Times is suing OpenAI and Microsoft for copyright infringement - https://news.ycombinator.com/item?id=38781941 - Dec 2023 (837 comments)

The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - https://news.ycombinator.com/item?id=38781863 - Dec 2023 (11 comments)

paulddraper2y ago

New York Times Sues Microsoft and OpenAI for 'Billions' - https://news.ycombinator.com/item?id=38791368 - Dec 2023 (1 comment)

New York Times Sues Microsoft and OpenAI, Alleging Copyright Infringement - https://news.ycombinator.com/item?id=38781718 - Dec 2023 (1 comment)

New York Times sues Microsoft and OpenAI over copyright infringement - https://news.ycombinator.com/item?id=38781908 - Dec 2023 (2 comments)

New York Times sues OpenAI, Microsoft for using articles to train AI - https://news.ycombinator.com/item?id=38782510 - Dec 2023 (1 comment)

New York Times sues OpenAI, Microsoft for allegedly infringing copyrighted work - https://news.ycombinator.com/item?id=38783699 - Dec 2023 (1 comment)

New York Times sues OpenAI, Microsoft over use of its stories to train chatbots - https://news.ycombinator.com/item?id=38784914 - Dec 2023 (1 comment)

NY Times sues OpenAI, Microsoft for infringing copyrighted works - https://news.ycombinator.com/item?id=38786330 - Dec 2023 (1 comment)

NYTimes sues OpenAI, Microsoft, for copyright infringement - https://news.ycombinator.com/item?id=38790845 - Dec 2023 (1 comment)

1 more reply

globular-toast2y ago· 1 in thread

> To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is.

If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.

contravariant2y ago

Yeah the comparison to humans is silly anthropomorphising at this point.

However I am inclined to agree with them for the simple fact that putting a file into a device and letting that device reproduce parts of the file should be allowed. I mean we're already at the point where this simple right is under pressure from DRM, but people should be allowed to do whatever they want with the files they own.

Whether you can publish this output and share it with the world is a whole different issue.

chmod6002y ago· 1 in thread

Isn't copyright tethered somehow to a notion of "expression"? That is, the same ideas and facts expressed differently are a different work?

Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?

visarga2y ago

As you said AI can rewrite articles, obtaining a clean cut separation between ideas and expression. Keep the ideas, write a new text. And if you got multiple sources, the more sources you use the better, it would make the output be even more different. This approach could also check consistency and bias between sources.

fsckboy2y ago· 1 in thread

in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.

A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.

aurareturn2y ago

I think it's different. LLMs can solve problems. Part of that problem-solving ability comes from training completely unrelated content such as NYT articles. GPT4 doesn't have to spit out NYT articles verbatim to have benefited from NYT articles. It uses NYT articles for every query.

1 more reply

mark_l_watson2y ago· 1 in thread

I think Apple has really got ahead of this game: early deals to pay for AI training data/content. I need to do some research but I think Anthropic also does this.

After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.

efields2y ago

Not sure if they’re ahead but I think it was smart to not ship anything LLMlike until the regulations get made first movers test the waters.

Casey Newton has been saying all year that these things will be awesome once we can unleash them on our own corpus of data safely. “Siri” already does a great job digging through my photos and picking the good memories. I can let my camera roll become a visual junk drawer now.

Do the same for my email. Make “Find” the tool we always wanted to be. I don’t care if I’m conflating LLMs/AI with other smart tech.

4death42y ago· 1 in thread

I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.

kjkjadksj2y ago

You can say the same for any legal enforcement like respecting patent or copyright law or making Champagne outside France. Yet the sky isn’t falling given this reality with so many legally protected industries. Maybe these markets where such an industry might offshore to are too small and insular to be very significant, and are probably language bound to make english models less relevant compared to native language models.

1 more reply

sensanaty2y ago· 1 in thread

I love seeing all the AI sycophants squirm at this news.

Here's to hoping NYT wins this one and gets everything they ask for, and more!

ugjka2y ago

I don't know if winning this will improve their business model

I don't use chat gpt to get the news but also i don't buy paywalls

frakrx2y ago· 1 in thread

Under existing condition an AI news site seems like a good investment idea. Its AI could read all relevant news sources and retell them and republish them in its own articles. It could even have its own AI editors and contributors. Cannot see how human news companies could compete.

logicchains2y ago

>Cannot see how human news companies could compete.

News ultimately comes from physical sources on the ground, which currently AI has no way of doing.

2 more replies

ChrisArchitect2y ago· 1 in thread

[dupe]

Discussion here: https://news.ycombinator.com/item?id=38781941

kristianp2y ago

True, the Verge article was posted here earlier.

fuzzfactor2y ago· 1 in thread

What if you were one of the people who read the Times from cover-to-cover every day and seriously tries to remember as much as possible because you consider it a trustworthy reference source?

And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?

What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?

And what if that was your only lifetime source of information whatsoever for some reason?

jjeaff2y ago

That would of course be fine.

But then imagine that because human memory is not able to keep all that information straight, you made copies of all those newspapers.

And then you started charging people for your knowledge.

And then imagine that as part of your knowledge service, you would copy snippets from the times word for word and give that to your clients without citation and pass it off as your own.

1 more reply

Kim_Bruning2y ago· 1 in thread

Huh, is this a big misunderstanding?

The copilot screenshot they gave in the ars-technica article as well as many of the screenshots in the NYT article seems like it's actually displaying correct behavior for browsing the web.

In these cases the system is more or less acting as a user agent (browser). AFAICT the NYT server actually gave that data to the user agent when it asked politely (200 OK, presumably). The user agent then displayed it to the user, which the user agent may do in any way it deems fit or appropriate.

There's only one or two cases where this has gone against the user or user agent, in very specific circumstances. The server can eg say 403 Forbidden whenever it likes, so if it returns a 200 OK, what's a user agent to do other than believe it at its word?

The only twist is that this user agent is now Imbued With AI (tm)(r)(c) . I don't think that really makes a difference here. If that's all this is, then it's more related to legal fights over certain ad-blockers or readability, which have similar functionality.

* https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... , eg. page 45; I mean it says "Model: Web Browsing" at the top, and "Finished browsing" right on the page. That particular subsystem is now integrated, so the UI/UX is different now, but IIRC the link was in the pulldown?

Kim_Bruning2y ago

Hum, on page 33 of the same complaint they show that they can get GPT to quote parts of an article from memory. Giving people single paragraphs need not be not illegal per-se. Very specifically, if you ask it for more than a paragraph at a time:

> ' I'm unable to display the entire text of "Snow Fall: The Avalanche at Tunnel Creek" by John Branch, as it is a copyrighted work. However, you can easily access the full story online. It was published by The New York Times and is available on their website. The story is notable for its engaging multimedia format, including text, images, and interactive elements.'

Specifically, they go out of their way to lead GPT on, asking for several paragraphs in a row.

It's pretty clear that GPT is an avid reader of the NYT, so in that particular case we're going to have to see if OpenAI's fair use defense for training holds.

(ps. in the current GPT-4, it's actually somewhat tricky to even get to the point above at all. They have probably been improving AI instructions)

ctoth2y ago· 1 in thread

Isn't the fundamental issue here that the NYT was available in Common Crawl?

If they didn't want to share their content, why did they allow it to be scraped?

If they did want to share their content, why do they care (hint: $88 billion)?

Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?

What wrong thing did OpenAI do specific to using Common Crawl?

Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?

Is it legal or not to scrape the web?

If I scrape the web, is it legal to train a transformer on it? Why or why not?

To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.

Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?

> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.

Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!

It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?

0]: https://www.adweek.com/media/the-new-york-times-updates-term...

rfw3002y ago

If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.

The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.

1 more reply

starchild30012y ago· 1 in thread

I asked an LLM to summarize the 69 page lawsuit. It does a decent job. Didn't infringe on any copyrights in the process :)

Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:

The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.

The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.

The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.

OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.

The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.

Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.

The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.

In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.

starchild30012y ago

A Second LLM's take on this lawsuit can be found below. I'd love to see OpenAI address these complaints publicly and without incurring any additional damages to NYT.

The document is a legal complaint filed by The New York Times Company against Microsoft Corporation and various OpenAI entities, alleging copyright infringement and other related claims. The New York Times Company (The Times) accuses the defendants of unlawfully using its copyrighted works to create artificial intelligence (AI) products that compete with The Times, particularly generative artificial intelligence (GenAI) tools and large language models (LLMs). These tools, such as Microsoft's Bing Chat and OpenAI's ChatGPT, allegedly copy, use, and rely heavily on The Times’s content without permission or compensation.

Nature of the Action: The Times emphasizes the importance of independent journalism to democracy and claims its ability to continue providing this service is threatened by the defendants' actions. The complaint argues that the GenAI tools are built upon unlawfully copied New York Times content, which undermines The Times's investments in journalism.

Defendants: The defendants include Microsoft Corporation and various OpenAI entities, such as OpenAI Inc., OpenAI LP, and several other related companies. The Times alleges these entities have worked together to create and profit from the GenAI tools in question.

Allegations: 1. Copyright Infringement: The Times claims the defendants copied millions of its copyrighted articles and other content to train their GenAI models. This training allegedly involves large-scale copying and use of The Times’s content, emphasizing its quality and value in building effective AI models.

2. Unlawful Competition: The Times argues that the defendants' GenAI tools compete with it by providing access to its content for free, which could potentially divert readers and revenue away from The Times.

3. Misattribution and Hallucinations: The Times asserts that the defendants' tools not only unlawfully distribute its content but also generate and attribute false information to The Times, damaging its credibility and trust with readers.

4. Trademark Dilution: The complaint includes claims that the defendants' use of The Times’s trademarks in connection with lower-quality or inaccurate AI-generated content dilutes and tarnishes its brand.

5. Digital Millennium Copyright Act Violations: The Times alleges that the defendants removed or altered copyright management information from its works, which is prohibited under the law.

Harm to The Times: The Times claims it has suffered significant harm from these actions, including loss of control over its content, damage to its reputation for accuracy and quality, and financial losses due to diminished traffic and revenue.

Demands: The Times seeks various forms of relief, including statutory damages, injunctive relief to prevent further infringement, destruction of the infringing AI models, and compensation for losses and legal fees.

Overall Summary: This legal complaint represents a significant clash between traditional media and emerging AI technology companies. It underscores the complex legal, ethical, and economic issues arising from the use of copyrighted content to train AI systems. The outcome of this case could have far-reaching implications for the AI industry, content creators, and the broader digital ecosystem.

kragen2y ago

this was predicted in the very influential epic 2014 video in 02004

https://www.youtube.com/watch?v=eUHBPuHS-7s (the original is flash and has thus been consigned to the memory hole, so we are left with this poor-quality conversion)

36": 'however, the press as you know it has ceased to exist'

40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'

2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'

5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'

5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'

they didn't get the details exactly right, but overall the accuracy is astounding

however, that may be a hyperstition artifact in this timeline

https://en.wikipedia.org/wiki/EPIC_2014 (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)

sunpazed2y ago

> “The tragedy of the Luddites is not the fact that they failed to stop industrialization so much as the way in which they failed. Human rebellion proved inadequate against the pull of technological advancement.”

https://www.newyorker.com/books/page-turner/rethinking-the-l...

cpt1002y ago

Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?

weikju2y ago

Probably has something to do with impending deals between NYT and major companies, e.g.

[0] https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...

[1] https://www.theverge.com/2023/12/22/24012730/apple-ai-models...

sackfield2y ago

Something I have wondered about LLMs and training data is the idea that the biggest content producers on the internet now have their world view and tone echoed disproportionately as part of the next big wave of technology. This is incredibly impactful (although admittedly I don't know how to turn that into a profit). Is there some long term impact of removing the New York Times from training data that means it won't be part of the LLMs corpus going forward that is unforeseen?

sylware2y ago

If they don't let AIs to be trained on a maximum of data as possible, those AIs will be less "good" than the ones trained without constraints like you will have in China or elsewhere, and people will mechanically start using the later.

Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.

labrador2y ago

NYT wants to outlaw a math game created by calculating the probablities of word groupings and words following each other in NYT times articles, along with a lot of of other writings NYT does not own. The players roll the dice, so to speak, by seeding an initial string of words and whoever comes up with the most interesting paragraph wins. This paragraph may or may not look like NYT times writings, which in the larger scheme the collected writings of humankind, isn't particularly unique. It doesn't even have to be true. Hallucinations are an expected outcome.

If a NYT article says "Henry Kissenger was known to eat ice cream on a hot day" and our game outputs the same, it is purely by chance. It cannot be proven the output was copied verbatim from the NYT because the fragment "Henry Kissenger was known to" and "eat ice cream on a hot day" are not unique to the NYT or exclusive to it.

Is the NYT claiming ownership of the weights in LLMs?

djhope992y ago

This argument that the LLM is learning seems slightly flawed when you consider that other experts in the field consider it more like lossy compression. If it’s lossy compression that’s really happening here then you can understand the copyright argument. It’ll be interesting to see how this plays out, lots of new ground breaking.

Mountain_Skies2y ago

Looks like this is a case of Media vs Tech which might be solved by the courts using past paradigms but should really be addresses by legislation specific to this situation. The difficulty for the media companies, at least in the US, is that both major political parties see the media as the enemy. The left might be a bit more positive about the media but overall, they still see the media as something owned by wealthy elites suppressing knowledge of the harm the powerful inflict on the weak and powerless. Over on the Tech side of things, one party sees Tech as wholly owned by other side of the political divide. Over on that side, things are relatively (but not completely) friendly, so my guess is Tech will end up winning simply because it has more friends in the political realm than the Media does.

lwhi2y ago

Surely there's no chance OpenAI would agree to this?

Isn't it more likely that the company buys the NYT?

rimeice2y ago

This wave is growing. Just cannot see how the big LLM players are going to get round this without paying big licence fees to content creators. Feels a bit like the torrent to Spotify moment, but for _all_ content, not just music. How they will manage the licensing model is beyond me, it’s going to be very easy for someone to sue these companies, but very difficult for the companies to calculate, attribute value and payout individual creators that contributed a tiny fraction of the training data. Surely this will make it very difficult for them to keep a business model working to a level their VC backers need to warrant even a fraction of their valuations.

kweingar2y ago

The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.

It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.

ml-anon2y ago

ITT: the perfect storm of people who dont know how LLMs work, dont know how the brain works and dont know how the law works.

1f60c2y ago

I believe that ChatGPT is fair use, just on a much larger scale than we're used to.

chucke19922y ago

While copyright laws are ancient and outdated - and probably should be reformed or removed altogether - this lawsuit might be entertaining.

shp0ngle2y ago

Microsoft is one of the companies that love to use copyright to get their way, BSA is known software mafia, so I'm not at all sympathetic to them.

dash22y ago

There’s an awful lot of confident statements be made about the law here. I wonder if anyone who is actually a lawyer would like to chime in.

skc2y ago

Kind of ironic that the NYT will still have to host articles extolling the virtues of OpenAI as it continues to expand and upend industries

throwuwu2y ago

If they lose they should delete the NY Times

exabrial2y ago

I'm actually fine with this. Copyright holders never consented to having their work used in this manner.

amadeuspagel2y ago

Two not-so subtle paragraphs about the "partnership" between Microsoft and OpenAI:

> 15. Microsoft Corporation is a Washington corporation with a principal place of business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.

> 16. Microsoft has described its relationship with the OpenAI Defendants as a “partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”

dewbrite2y ago

Summarizing the article: The most damning thing here is the "ChatGPT as a search engine" feature, which appears to run an agent which performs a search, visits pages, and returns the best results.

In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.

ranting-moth2y ago

Let's try the "reverse the gender" card.

Let's say OpenAI was trained on all the Windows source code (without approval from MS).

GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."

It would infuriate MS to put it mildly, enough to start a lawsuit.

I know the license to the MS source code and NYT articles aren't the same.

wseqyrku2y ago

Sounds like 2024 is gonna be the year of lawsuits like this.

poorman2y ago

Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?

thunkshift12y ago

I think this could be shakedown. They want money/licensing from openai the way apple was offering news companies recently. High probability this is settled out of court.

fbhabbed2y ago

This is getting a bit out of hand isn't it.

joshxyz2y ago

only winner here is the lawyers of both parties laughing their way to the bank.

god i love this era, so much grey area in these edge technologies.

thatgerhard2y ago

what a "surprisingly" short-signed approach by a obsolete media giant

visarga2y ago

Wondering who tf reads old NYT articles? News become old really fast. chatGPT is months or years behind.

kolinko2y ago

Worth noting, that - at least the screenshot - shows an example of browsing functionality used to go around paywalls, not that the model itself is trained, or can reproduce the articles really.

IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.

test65542y ago

How do I put this... Whether NYT is right or wrong, their case should be dismissed.

andrewstuart2y ago

Means nothing.

An ambit claim that Rupert is throwing out there to see what he can get.

cynicalsecurity2y ago

Nothing will come out of it. NY times will lose.

ryukoposting2y ago

> All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall.

Not to be pedantic, but NYT has the least robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.

nektro2y ago

oh how joyous that would be. I so hope they win

altals20232y ago

Won't hold in court. GPT is a platform mainly providing answer to private individuals asking. Is like you ask a professor a question and he answered verbatim what copyrighted materials available (due to photographic memory) word for word back to you. Now if you take this answer and write a book or publish enmass on blogs for example, then you are the one should be sued by NYT. If GPT use the exact same wordings and publish it out to evetyone visiting their page, then that is on OpenAI.

5 more replies

andy992y ago

I don't think the lawsuit has any merit, but I'd still like to encourage Sam Altman et al, if they really care about the greater good, to go Keyser Söze and immediately release torrents of the weights and source code for GPT-4 under GPL.

5 more replies

atleastoptimal2y ago

It's obviously a frivolous suit that will only net at best a ceremonial victory for NYTimes: 8 figure max payout and a promise to not use NYtimes material in the future.

The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.

And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.

4 more replies

outside12342y ago

Seems reasonable - they probably broke the TOS of the site

3 more replies

hazmazlaz2y ago

I'd rather have GPT than the NY Times, if I had to choose between one or the other.

unstatusthequo2y ago

I’d be happy if the NYT was deleted. I find it has very little use as a source of anything, much like most mainstream media.

cycrutchfield2y ago

I read a NYT article and publish a summary of facts that I learned: totally legit.

Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.

4 more replies

kazinator2y ago

Should be: "NY Times wants OpenÄI to delete all GPT instances". You wouldn't want the hapless rabble misreading it as an "aiii" diphthong.

1 more reply

j0hnyl2y ago

I hope the world can rally and move past these anachronistic ideas of intellectual property.

ehwhwhwhahhwh2y ago

NYT could also fix the issue by deleting NYT itself. Could be a better result for humanity as well. Thanks.

j / k navigate · click thread line to collapse