story
No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?
A few factors that come to mind would be:
- scale
- informed consent which there was none in this case
- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.
So here's the question:
Does a person reading a comment destroy the incentive for the author to post it? No. In fact, it is the only thing that produces the incentive for someone to post. People post here when they want that thing to be read by someone else.
Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output? Yes. At least, that is the goal of such a model -- to become so good it is competitive with human artists.
Of course you have plenty of people positioned benefit from this incentive-destruction claiming it does no such thing. I personally tend to put more credence in the words of people who have historically actually been incentivized by said incentives (i.e. artists) who generally seem to perceive this as destructive to their desire to create and share their work.
Copyright, at least in the US, cares about the effect of the use on the market for that specific work. It's individual ownership, not collective. And while model regurgitation happens, it's less common than you think.
The real harm of AI to artists is market replacement. That is, with everyone using image generators to pop out images like candy, human artists don't have a market to sell into. This isn't even just a matter of "oh boo hoo I can't compete with Mr. Diffusion". Generative AI is very good at creating spam, which has turned every art market and social media platform into a bunch of warring spambots whose output is statistically indistinguishable from human.
The problem is, no IP law in the world is going to recognize this as a problem, because IP is a fundamentally capitalist concept. Asserting that the market for new artistic works and notoriety for those works should be the collective property of artists and artists alone is not a workable legal proposal, even if it's a valid moral principle. And conversely the history of copyright has seen it be completely subverted to the point where it only serves the interests of the publishers in the middle, not the creators of the work in question. Hell, the publishers are licking their chops as to how many artists they can fire and replace with AI, as if all their whinging about Napster and KaZaA 24 years ago was just a puff piece.
It isn’t clear to me that these models destroy incentive to create. I mean, ChatGPT can generate comments in my style all day, and yet I’m still incentivized to comment.
I fancy myself a photographer. I still want to take photos even if DALL-E 4 will generate better ones.
What even is the point of creating art? I think there are two purposes: personal expression and enjoyment for others.
People will continue to express themselves even if a bot can produce better art.
And if a bot can produce enjoyment for others en masse, then that seems like a huge win for everybody.
Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?
Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?
Data usage: Same question as above.
I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.
There is a whole genre of copyright infringement where someone will scrape a website and create a per-pixel copy of it but loaded up with ads, and blackhat SEOed to show up above the original website on searches. That's bad, and to the extent that LLMs are doing similar things, they are bad too.
Imagine I scrape your elaborate GameFAQs walkthrough of A Link to the Past. I could 1) use what I learn to direct curious people to its URL, or 2) remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game. Then I sell this service as a revolutionary breakthrough that will free people from relying on carefully poring through GameFAQs walkthroughs ever again.
People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.
Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?
If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.
If you compile all that information in a database and use it to answer search queries that's also okay, and nothing forbids you from using machine learning on that data to better answer those search queries.
Both of the above are actually being challenged right now but for the time being they're fine.
But that database is a derivative work, in that it contains copyrighted material and so how you use it matters if you want to avoid infringement — for example a Google employee SSHing to a server to read NYT articles isn't kosher.
What isn't clear is whether the model is a derivative work. Does it contain the information or is it new information created from the training data Sure, if you're clever you could probably encode information in the weights and use it as a fancy zip file but that's a matter of intent. If you use Rewind or Windows Recall and it captures a screenshot of a NYT article and then displays it back to you later is that a reproduction? Surely not. And that's an autonomous system that stores copywritten data and regurgitates it verbatim.
So if it's impractical to actually use it for piracy and it very obviously isn't anyone's intent for it to be used as such then I think it's hard to argue it shouldn't be allowed, even on data that was acquired through back channels.
But copyright is more political than logical so who knows what the legal landscape will be in 5 years, especially when AI companies have every incentive to use their lawyers to pull the ladder up behind them.
AI is a unique third case in which we have billions of creators and no idea who contributed what parts of the model or any specific outputs. So we can't pay in exposure, aside from a brutally long list of unwilling data subjects that will never be read by anyone. Some of the training data is being regurgitated unmodified and needs to be attributed in full, some of it is just informing a general understanding of grammar and is probably being used under fair use, and yet more might not even wind up having any appreciable effect on the model weights.
None of this matters because nobody actually agreed to be paid in exposure, nor was it ever in any AI company's intent - including Apple - to pay in exposure. Data is free purely because it would be extraordinarily inconvenient if anyone in this space had to pay.
And, for the record, this applies far wider than just image or text generators. Apple is almost surely not the worst offender in the space. For example: all that facial recognition tech your local law enforcement uses? That was trained on your Facebook photos.