Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.
Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.
A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.
But then what do I know...
Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)
Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.
You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.
Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.
That's not a good question.
If I look out of my window and see my neighbor go to the shop, that's fine. If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.
Logic does not necessarily apply when scaling is involved.
Consider the analogy from libraries that want to do data mining.
"Unfortunately, in licenses for digital scholarly content the majority of content acquired by research libraries publishers often include terms that prohibit certain uses that would otherwise be allowable under the Copyright Act. For instance, licenses may require libraries or individual researchers to negotiate for otherwise lawful activities, such as text and data mining, and to pay exorbitant fees on top of the cost of the content itself. While new regulations allow researchers to circumvent technological protection measures to access copyrighted materials, licenses for that content may include terms that explicitly prohibit this circumvention. In many cases, these activities might actually increase the value of published material; for instance, if a data-mining project yields new knowledge about a topic covered in a journal, it may very well spark new interest in that journals content. Libraries and publishers have often assumed that license terms that restrict copyright exceptions are enforceable under state contract law. There is, however, surprisingly little case law on this point."
https://www.arl.org/wp-content/uploads/2022/07/Copyright-and...
Putting some string in a robots.txt to try to stop data collection is an amusing "solution". Should copyright owners have "Terms of Use" that limit usage for commercial "AI" purposes.
As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.
Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.
This article also claims that the screenshot is coming from ChatGPT when it clearly is not.
> Implications: The Ninth Circuit's declaration that selectively banning potential competitors from accessing and using data that is publicly available can be considered unfair competition under California law may have large implication for antitrust law. [citation needed]
> Other countries with laws to prevent monopolistic practices or anti-trust laws may also see similar disputes and prospectively judgements hailing commercial use of publicly accessible information. While there is global precedence by virtue of large companies such as Thomson Reuters, Bloomberg or Google [or LexisNexis or Westlaw] effectively using web-scraping or crawling to aggregate information from disparate sources across the web, fundamentally the judgement by Ninth Circuit fortifies the lack of enforceability of browse-wrap agreements over conduct of trade using publicly available information.
So like....Wikipedia, CliffNotes, encyclopedias, etc?
None of these pay royalties to original.
Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.
I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.
So the demand to destroy those databases seems very dubious to me.
Of course later violating fair use is another issue.
I think people severely underestimate how much they've grown accustomed to this information being freely available. It's easy to say "Well it shouldn't be available with ChatGPT," but if we actually put everything back behind a paywall and stopped people from doing things like writing blogs or newsletters that summarize the news, people here would get angry very fast.
Google has been accused for years of replacing sources with their "One Box"--the big answers at the top of the page, which are usually pulled from or corroborated by search results. They don't want you to leave the search results page (where the ads are).
>Is that fair use? IANAL, but doesn't sound like it.
If you pay someone to do the summarisation for you, then you publish the content and charge a fee for it, you're the one liable, not the person you paid to summarise it for you. Similarly if you ask GPT to do it for you, then publish it, you're liable for what you publish; GPT is just a summarisation tool.
However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."
This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?
This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.
But this will blow up in their face. Let's see:
- AI companies will start investing much more in content attribution
- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret
- Then people will start seeing a chilling effect on creativity
- We must also check NYT against all the other sources, not everything the write is original
LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.
The second piece of evidence is this paper explained here https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... where instead of an LLM researchers used gzip compressed data as a model and it even beat trained LLMs.
AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.
There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.
Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.
I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.
Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.
That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?
It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.
The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)
[1] https://en.wikipedia.org/wiki/Substantial_similarity#Substan...
Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.
But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.
The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?
Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?
I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).
The thing about you claim, "Just learn to recognize and punish plagiarism via RLHF" is that we've had an endless series of prompt exploits as well as unprompted leakage and these demonstrate that an LLM just doesn't have fixed border between its training data and its output. This will it basically impossible for OpenAI to say "we can logically guarantee ChatGPT won't serve your data freely to anyone".
OpenAI has created a $100bn company on this transfer. The Times may have an interest in a material fraction of that wealth.
Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?
And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.
I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.
And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?
As you noted it is not the norm to post pirate links here for IP other than news articles, but that doesn't mean that a lot of people think it is not OK to pirate those other forms of IP.
In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").
The impression I've gotten is that piracy of nearly everything is widely felt to be OK here. Information wants to be free, yada yada.
About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.
[1] https://en.m.wikipedia.org/wiki/International_News_Service_v...
If the story was linking directly to the "book, TV show, movie, video game, album, comic book, etc", and the link only worked for some people while others randomly got a login request or similar, you'd also see the top comment being a link to an archived version which avoids the login screen. That is: the main difference is that the archive link has the exact same content as the link submitted in the story, only bypassing the login screen that some people see. And the only reason the archive site has the content is that it didn't get the login screen; if everyone always got the login screen, what you would see on the archive site would be the same login screen.
It’s amazing the amount of books that copyright laws prevent us from finding
https://www.theatlantic.com/technology/archive/2012/03/the-m...
I fully assume that if I was to post a magnet link to a torrent for whatever the link was about, I would be banned.
Morally speaking, I think it's perfectly reasonable to download a copy of something and either read the relevant info for my current task or to sample it to decide if I want to buy it. I see it no different to using the library or browsing at a book store.
Perhaps once news organisations can work out how to effectively wield the DMCA hammer against archive links we'll see the practice of posting them stop.
And the vast majority of people read news for it's breaking content, not for its archived content from years before (and I say this as someone who has often recommended the latter, but has gotten very few people to do so). So giving people that free breaking content (either in its entirety like on Hacker News, or summaries like you see all over social media) is actually a direct competition to the news business in a way that training an LLM on an article from months/years back isn't.
I'd say the only real reason the Piratebay links thing you mentioned is not the norm is purely because those media sources have done a better job of striking fear into people doing that, so it's gone more underground. I.e. they're better terrorists.
There's no fundamental, moral reason why Piratebay links being posted and raised to the top would be wrong.
"Piracy is almost always a service problem and not a pricing problem."
edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.
[0] https://news.ycombinator.com/item?id=26174269 ("Before buying a NYT subscription, here's what it'll take to cancel it", 812 comments)
Hypocrites are EVERYWHERE and are the majority.
1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.
2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.
I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.
Analyzing the factors involved for a "fair use" consideration:
Purpose and Character of the Use: While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.
Nature of the Copyrighted Work: Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.
Amount and Substantiality of the Portion Used: In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.
Effect on the Market Value: NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.
IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.
Rent seeking? Media companies that actually create content are rent seeking? Versus the garbage hallucinations AI creates?
Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?
https://finance.yahoo.com/news/beijing-tries-regulate-china-...
https://nitter.unixfox.eu/CDT/status/1625936306814717952?337...
I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.
You don't have to agree with it. You don't have to like it. But if you accept it and live by it, it's much harder to get burned.
Your second point reminds me a bit of 'War with the Newts' where humanity arms a race of sentient salamanders until they overthrow humanity. How could we not arm our newts if Germany might be arming theirs?
I also think basically everything else you wrote is wrong.
There are plenty of large companies in other sectors that acknowledge there are limited legal remedies for them if someone copies some aspect of their business or name.
Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.
It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?
edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.
The law isn’t settled, it’s a genuine legal question mark.
It ain’t frivolous or trolling or ridiculous.
The key is to stop calling it "training" and use "learning" or just "reading".
The argument from NYT will probably be that LLMs are just a fancy way to compress or abstract information and spit it back out. In which case "training" seems to support their case?
Because for many people, their views on current events are whatever the "thought leaders" working for the NYT and similar publications tell them to think.
This is theft and monstrous profit from theft. For actual justice this should be a class action suit of the world vs. OpenAI/Microsoft and the financial consequences should be company-ending for OpenAI. Otherwise, you have incented everyone in the AI industry to steal as much as they can for as long as they can.
I hope a court establishes some rules of engagement here, even if it’s not this case.
I think all this is doing is making us realize that we have built a massive economic system on a fundamentally flawed idea of ownership over ideas, and the only two solutions will be to tear up the rule book, which will be extremely painful, or double down, which will be fatal.
in japan, where they said anything goes for ai
so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it
While harder to do as a human, if memorised a copyrighted book and then did a live reading on TV, or produced replicas from memory and sold them (the most comparable example), I’d be sued.
Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.
You'd get the same problem with someone with a photographic memory who a group of people would turn to recite them the news instead of buying the newspaper.
As of now public performance of copyrighted material is infringement.
Are they all owned by one mega-corporation, which is going to do as capitalism does, and use them to squeeze money out of all of us? Then I'm happy to ban them.
My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.
I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.
I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.
To put it another way, let's say I turn the dial all the way the other way, I train the worlds crappest LLM on NYT material, it massively massively overfits and all it will ever return is verbatim snippets of the NYT. Is that copyright infringement?
The core part of the argument here is actually just that OpenAI doesn't want to adhere to what the current standard is for using copyrighted material, if you want to use it and create something new with it you need to license the material. Since OpenAI's LLM isn't actually like a human it needs to license such a vast dataset that it would be uneconomical to run the business without stealing all the content.
So, if you setup a service like ChatGPT but powered by humans responding real time to queries, and these humans would occasionally reproduce large chunks of NYT articles, they and the service itself would be liable for copyright infringement. Even if they were all reproducing these from memory.
Now, this is somewhat different from the discussion of whether training the model on the copyrighted data, even if it had effective protections from returning copies of it, constitutes copyright infringement in itself. I believe this is a somewhat novel legal question and I can think of no direct corollaries.
I certainly don't think we can just handwave and say "at some level, when a human reads a copyrighted work, they are doing the same thing", because we really don't know if that is true. Artifical neural networks certainly have no direct similarity with the neural networks in the brain as far as we can tell. And, even if they did, there is no reason to give a machine the same rights that a human has - certainly not until that machine can prove sentience.
"Its what we do all the time" is a major assumption
https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.
Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.
As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.
I imagine the goal is closer to "enough that no one notices we stole it", either in a way that it's not easily discoverable or even when directly analyzed there's enough plausible deniability to scrape by.
It makes it difficult for me to ascertain whether it is repeating from it's training data, or they committed the same mistake as the OP article of using Copilot, which ends up googling(binging?) the article first, before replying.
But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.
The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.
It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.
Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.
The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.
None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.
LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.
NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.
They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.
Can you make the argument this was their fault for not having forward vision/being asleep at the wheel and "accidentally, in hindsight" letting OpenAI/others have free, open, unlimited access to their content?
Then LLMs would be distributed only via torrents, like most copyright infringing media.
LLM's will become more expensive and less attractive as money printers, this will screw with the business models of the direct provision folks like OpenAI, MS and Google, MS and Google will only shed tears for money spent while OpenAI will just not have as good an income stream until they think of something new.
[1] https://theintercept.com/2023/09/17/new-york-times-website-i...
[2] https://fortune.com/2023/08/25/major-media-organizations-are...
OpenAI's distribution is materially different to that of a library, so it's not a like-for-like comparison.
One of the main tests of copyright law (at least in the US) is if the entity distributing is _selling_ the copied/derivative work. It's unambiguous that OpenAI is selling something akin to derivative works, which is why NYT feels they can go after this claim. Meanwhile IA's operations don't create sales or incur profits, so while NYT's legal team may be able to establish that copies have been distributed, without the _sale_ aspect of the infringement, judges aren't guaranteed to side with NYT in an legally expensive PR nightmare.
There is nothing wrong with profit seeking from your copyright. That's literally their entire business model...they publish copyrighted content which they sell for a subscription.
OpenAI and others could easily have negotiated a licence instead of just using the data. They bet that it would be cheaper to be sued, lets find out if they bet correctly.
Tangentially that's what Apple did with the sensor in their watch, it doesn't always pay off.
The US constitution says, The Congress shall have Power
> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;
So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.
I think we all agree that no one is entitled to “progress of science” at any cost - as a straw man, killing hundreds of newborn babies for scientific research is not great - so we use ethics and the legal system to find the line of what’s acceptable.
I don’t know exactly what NYT is asking for here, but the two options aren’t unconsented training vs nothing at all. NYT could license, for a fee, its content to OpenAI. It’s pretty common for scientists to have to pay for materials!
Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.
Is there any research into how people from earlier eras thought about it? And should all laws that seemed ludicrous to someone from an earlier era be discarded? If not, how exactly do we determine the relevance of what someone from an earlier era would think about our laws?
This rings similar.
"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""
For an example, you referenced "what happened with Google News's home page". Could you give me your source? You could probably search for some suitable article for a reference, but you don't know a source from your memory.
This is interesting. The NYT is specifically saying that the way you use an LLM impacts what you can legally use for training the LLM. They're firing shots at the big guys trying to sell access to an LLM, but not at the little guy self-hosting for fun or academics doing research.
https://www.npr.org/2023/05/18/1176881182/supreme-court-side...
- Read 20 different news websites and their story on the same event/topic
- Wait an hour, grab a cup of coffee
- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head
- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.
- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.
Is that fair use? Am I infringing on copyright?
The thing about lawsuits is that you make dozens of claims, and the court can rule in favor of some of them, and against others. The question of "is LLM training fair use?" hasn't made it to a high court yet. The court could very easily rule against everything else in the suit.
It’s a four part test. Let’s examine it thusly:
1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.
2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.
3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.
4. Effect on the market. Almost certainly negative for the NYT.
IMO, OpenAI cannot successfully claim fair use.
NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - https://news.ycombinator.com/item?id=38784194 - Dec 2023 (80 comments)
The New York Times is suing OpenAI and Microsoft for copyright infringement - https://news.ycombinator.com/item?id=38781941 - Dec 2023 (837 comments)
The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - https://news.ycombinator.com/item?id=38781863 - Dec 2023 (11 comments)
New York Times Sues Microsoft and OpenAI, Alleging Copyright Infringement - https://news.ycombinator.com/item?id=38781718 - Dec 2023 (1 comment)
New York Times sues Microsoft and OpenAI over copyright infringement - https://news.ycombinator.com/item?id=38781908 - Dec 2023 (2 comments)
New York Times sues OpenAI, Microsoft for using articles to train AI - https://news.ycombinator.com/item?id=38782510 - Dec 2023 (1 comment)
New York Times sues OpenAI, Microsoft for allegedly infringing copyrighted work - https://news.ycombinator.com/item?id=38783699 - Dec 2023 (1 comment)
New York Times sues OpenAI, Microsoft over use of its stories to train chatbots - https://news.ycombinator.com/item?id=38784914 - Dec 2023 (1 comment)
NY Times sues OpenAI, Microsoft for infringing copyrighted works - https://news.ycombinator.com/item?id=38786330 - Dec 2023 (1 comment)
NYTimes sues OpenAI, Microsoft, for copyright infringement - https://news.ycombinator.com/item?id=38790845 - Dec 2023 (1 comment)
If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.
However I am inclined to agree with them for the simple fact that putting a file into a device and letting that device reproduce parts of the file should be allowed. I mean we're already at the point where this simple right is under pressure from DRM, but people should be allowed to do whatever they want with the files they own.
Whether you can publish this output and share it with the world is a whole different issue.
Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?
A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.
After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.
Casey Newton has been saying all year that these things will be awesome once we can unleash them on our own corpus of data safely. “Siri” already does a great job digging through my photos and picking the good memories. I can let my camera roll become a visual junk drawer now.
Do the same for my email. Make “Find” the tool we always wanted to be. I don’t care if I’m conflating LLMs/AI with other smart tech.
Here's to hoping NYT wins this one and gets everything they ask for, and more!
I don't use chat gpt to get the news but also i don't buy paywalls
News ultimately comes from physical sources on the ground, which currently AI has no way of doing.
Discussion here: https://news.ycombinator.com/item?id=38781941
And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?
What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?
And what if that was your only lifetime source of information whatsoever for some reason?
But then imagine that because human memory is not able to keep all that information straight, you made copies of all those newspapers.
And then you started charging people for your knowledge.
And then imagine that as part of your knowledge service, you would copy snippets from the times word for word and give that to your clients without citation and pass it off as your own.
The copilot screenshot they gave in the ars-technica article as well as many of the screenshots in the NYT article seems like it's actually displaying correct behavior for browsing the web.
In these cases the system is more or less acting as a user agent (browser). AFAICT the NYT server actually gave that data to the user agent when it asked politely (200 OK, presumably). The user agent then displayed it to the user, which the user agent may do in any way it deems fit or appropriate.
There's only one or two cases where this has gone against the user or user agent, in very specific circumstances. The server can eg say 403 Forbidden whenever it likes, so if it returns a 200 OK, what's a user agent to do other than believe it at its word?
The only twist is that this user agent is now Imbued With AI (tm)(r)(c) . I don't think that really makes a difference here. If that's all this is, then it's more related to legal fights over certain ad-blockers or readability, which have similar functionality.
* https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... , eg. page 45; I mean it says "Model: Web Browsing" at the top, and "Finished browsing" right on the page. That particular subsystem is now integrated, so the UI/UX is different now, but IIRC the link was in the pulldown?
> ' I'm unable to display the entire text of "Snow Fall: The Avalanche at Tunnel Creek" by John Branch, as it is a copyrighted work. However, you can easily access the full story online. It was published by The New York Times and is available on their website. The story is notable for its engaging multimedia format, including text, images, and interactive elements.'
Specifically, they go out of their way to lead GPT on, asking for several paragraphs in a row.
It's pretty clear that GPT is an avid reader of the NYT, so in that particular case we're going to have to see if OpenAI's fair use defense for training holds.
(ps. in the current GPT-4, it's actually somewhat tricky to even get to the point above at all. They have probably been improving AI instructions)
If they didn't want to share their content, why did they allow it to be scraped?
If they did want to share their content, why do they care (hint: $88 billion)?
Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?
What wrong thing did OpenAI do specific to using Common Crawl?
Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?
Is it legal or not to scrape the web?
If I scrape the web, is it legal to train a transformer on it? Why or why not?
To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.
Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?
> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.
Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!
It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?
0]: https://www.adweek.com/media/the-new-york-times-updates-term...
The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.
Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:
The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.
The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.
The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.
OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.
The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.
Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.
The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.
In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.
The document is a legal complaint filed by The New York Times Company against Microsoft Corporation and various OpenAI entities, alleging copyright infringement and other related claims. The New York Times Company (The Times) accuses the defendants of unlawfully using its copyrighted works to create artificial intelligence (AI) products that compete with The Times, particularly generative artificial intelligence (GenAI) tools and large language models (LLMs). These tools, such as Microsoft's Bing Chat and OpenAI's ChatGPT, allegedly copy, use, and rely heavily on The Times’s content without permission or compensation.
Nature of the Action: The Times emphasizes the importance of independent journalism to democracy and claims its ability to continue providing this service is threatened by the defendants' actions. The complaint argues that the GenAI tools are built upon unlawfully copied New York Times content, which undermines The Times's investments in journalism.
Defendants: The defendants include Microsoft Corporation and various OpenAI entities, such as OpenAI Inc., OpenAI LP, and several other related companies. The Times alleges these entities have worked together to create and profit from the GenAI tools in question.
Allegations: 1. Copyright Infringement: The Times claims the defendants copied millions of its copyrighted articles and other content to train their GenAI models. This training allegedly involves large-scale copying and use of The Times’s content, emphasizing its quality and value in building effective AI models.
2. Unlawful Competition: The Times argues that the defendants' GenAI tools compete with it by providing access to its content for free, which could potentially divert readers and revenue away from The Times.
3. Misattribution and Hallucinations: The Times asserts that the defendants' tools not only unlawfully distribute its content but also generate and attribute false information to The Times, damaging its credibility and trust with readers.
4. Trademark Dilution: The complaint includes claims that the defendants' use of The Times’s trademarks in connection with lower-quality or inaccurate AI-generated content dilutes and tarnishes its brand.
5. Digital Millennium Copyright Act Violations: The Times alleges that the defendants removed or altered copyright management information from its works, which is prohibited under the law.
Harm to The Times: The Times claims it has suffered significant harm from these actions, including loss of control over its content, damage to its reputation for accuracy and quality, and financial losses due to diminished traffic and revenue.
Demands: The Times seeks various forms of relief, including statutory damages, injunctive relief to prevent further infringement, destruction of the infringing AI models, and compensation for losses and legal fees.
Overall Summary: This legal complaint represents a significant clash between traditional media and emerging AI technology companies. It underscores the complex legal, ethical, and economic issues arising from the use of copyrighted content to train AI systems. The outcome of this case could have far-reaching implications for the AI industry, content creators, and the broader digital ecosystem.
https://www.youtube.com/watch?v=eUHBPuHS-7s (the original is flash and has thus been consigned to the memory hole, so we are left with this poor-quality conversion)
36": 'however, the press as you know it has ceased to exist'
40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'
2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'
5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'
5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'
they didn't get the details exactly right, but overall the accuracy is astounding
however, that may be a hyperstition artifact in this timeline
https://en.wikipedia.org/wiki/EPIC_2014 (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)
https://www.newyorker.com/books/page-turner/rethinking-the-l...
[0] https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...
[1] https://www.theverge.com/2023/12/22/24012730/apple-ai-models...
Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.
If a NYT article says "Henry Kissenger was known to eat ice cream on a hot day" and our game outputs the same, it is purely by chance. It cannot be proven the output was copied verbatim from the NYT because the fragment "Henry Kissenger was known to" and "eat ice cream on a hot day" are not unique to the NYT or exclusive to it.
Is the NYT claiming ownership of the weights in LLMs?
Isn't it more likely that the company buys the NYT?
It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.
> 15. Microsoft Corporation is a Washington corporation with a principal place of business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.
> 16. Microsoft has described its relationship with the OpenAI Defendants as a “partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”
In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.
Let's say OpenAI was trained on all the Windows source code (without approval from MS).
GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."
It would infuriate MS to put it mildly, enough to start a lawsuit.
I know the license to the MS source code and NYT articles aren't the same.
god i love this era, so much grey area in these edge technologies.
IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.
An ambit claim that Rupert is throwing out there to see what he can get.
Not to be pedantic, but NYT has the least robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.
The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.
And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.
Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.