undefined | Better HN

0 pointsbigyikes2y ago0 comments

Does that imply I just stole your comment by reading it?

No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?

0 comments

13 comments · 5 top-level

mdhb2y ago· 6 in thread

I don’t actually think this is complicated and reading a comment is not the same thing as scraping the internet and you obviously know that.

A few factors that come to mind would be:

- scale

- informed consent which there was none in this case

- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.

llamaimperative2y ago

I think it's even simpler than that: incentives. The entire premise of copyright law (and all IP law) is to protect the incentive to create new stuff, which is often a very risky and highly time or capital intensive endeavor.

So here's the question:

Does a person reading a comment destroy the incentive for the author to post it? No. In fact, it is the only thing that produces the incentive for someone to post. People post here when they want that thing to be read by someone else.

Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output? Yes. At least, that is the goal of such a model -- to become so good it is competitive with human artists.

Of course you have plenty of people positioned benefit from this incentive-destruction claiming it does no such thing. I personally tend to put more credence in the words of people who have historically actually been incentivized by said incentives (i.e. artists) who generally seem to perceive this as destructive to their desire to create and share their work.

kmeisthax2y ago

> Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output?

Copyright, at least in the US, cares about the effect of the use on the market for that specific work. It's individual ownership, not collective. And while model regurgitation happens, it's less common than you think.

The real harm of AI to artists is market replacement. That is, with everyone using image generators to pop out images like candy, human artists don't have a market to sell into. This isn't even just a matter of "oh boo hoo I can't compete with Mr. Diffusion". Generative AI is very good at creating spam, which has turned every art market and social media platform into a bunch of warring spambots whose output is statistically indistinguishable from human.

The problem is, no IP law in the world is going to recognize this as a problem, because IP is a fundamentally capitalist concept. Asserting that the market for new artistic works and notoriety for those works should be the collective property of artists and artists alone is not a workable legal proposal, even if it's a valid moral principle. And conversely the history of copyright has seen it be completely subverted to the point where it only serves the interests of the publishers in the middle, not the creators of the work in question. Hell, the publishers are licking their chops as to how many artists they can fire and replace with AI, as if all their whinging about Napster and KaZaA 24 years ago was just a puff piece.

1 more reply

bigyikesOP2y ago

Thanks, this is a helpful comment.

It isn’t clear to me that these models destroy incentive to create. I mean, ChatGPT can generate comments in my style all day, and yet I’m still incentivized to comment.

I fancy myself a photographer. I still want to take photos even if DALL-E 4 will generate better ones.

What even is the point of creating art? I think there are two purposes: personal expression and enjoyment for others.

People will continue to express themselves even if a bot can produce better art.

And if a bot can produce enjoyment for others en masse, then that seems like a huge win for everybody.

3 more replies

bigyikesOP2y ago

I personally disagree but you make fair points.

Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?

Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?

Data usage: Same question as above.

I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.

roywiggins2y ago

There's a big difference between scraping a website so you can direct curious people to it (Googlebot) and scraping a website so you can set up a new website that conveys the same information, but earns you money and doesn't even credit the sources used (which these LLM services often do).

There is a whole genre of copyright infringement where someone will scrape a website and create a per-pixel copy of it but loaded up with ads, and blackhat SEOed to show up above the original website on searches. That's bad, and to the extent that LLMs are doing similar things, they are bad too.

Imagine I scrape your elaborate GameFAQs walkthrough of A Link to the Past. I could 1) use what I learn to direct curious people to its URL, or 2) remove your name from it, cut it into pieces, and rehost the content on my own page, mashed up with other walkthroughs of the same game. Then I sell this service as a revolutionary breakthrough that will free people from relying on carefully poring through GameFAQs walkthroughs ever again.

People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.

2 more replies

cwp2y ago

Reading a comment is exactly the same thing as scraping the internet, you just stop sooner.

cush2y ago· 1 in thread

Reading, no. Selling derivative works using, yes.

cwp2y ago

If I read your comment, then write a reply, is it a derivative work?

renewiltord2y ago· 1 in thread

Data gets either stolen or freed depending on whether the guy who copied it is someone you dislike or like. Personally, I think that Apple is giving the data more exposure which, as I've been informed many times here, is much more valuable than paying for the data.

kmeisthax2y ago

The irony of "do it for the exposure" is that everyone who actually wants to pay you in exposure isn't actually going to do that, either because they aren't popular enough to measurably expose you, or because they're so popular that they don't want to share the limelight.

AI is a unique third case in which we have billions of creators and no idea who contributed what parts of the model or any specific outputs. So we can't pay in exposure, aside from a brutally long list of unwilling data subjects that will never be read by anyone. Some of the training data is being regurgitated unmodified and needs to be attributed in full, some of it is just informing a general understanding of grammar and is probably being used under fair use, and yet more might not even wind up having any appreciable effect on the model weights.

None of this matters because nobody actually agreed to be paid in exposure, nor was it ever in any AI company's intent - including Apple - to pay in exposure. Data is free purely because it would be extraordinarily inconvenient if anyone in this space had to pay.

And, for the record, this applies far wider than just image or text generators. Apple is almost surely not the worst offender in the space. For example: all that facial recognition tech your local law enforcement uses? That was trained on your Facebook photos.

xwolfi2y ago

But then if I write a Pulitzer prize article called "No snark intended: How the web became such a toxic place", where your comment, and all other of ur comments for good measure, figure prominently while I ridicule you and this habit of dumbing down complex problems to reduce them to little witty bites, maybe you'd feel I stole something.

Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?

Spivak2y ago

I think scale is what changes the nature of the thing. At the point where you're having a machine consume billions of documents I don't think you could reasonably call that reading anymore. But what you are doing in my eyes is indexing, and the legal basis for that is heavily dependent on what you do with it.

If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.

If you compile all that information in a database and use it to answer search queries that's also okay, and nothing forbids you from using machine learning on that data to better answer those search queries.

Both of the above are actually being challenged right now but for the time being they're fine.

But that database is a derivative work, in that it contains copyrighted material and so how you use it matters if you want to avoid infringement — for example a Google employee SSHing to a server to read NYT articles isn't kosher.

What isn't clear is whether the model is a derivative work. Does it contain the information or is it new information created from the training data Sure, if you're clever you could probably encode information in the weights and use it as a fancy zip file but that's a matter of intent. If you use Rewind or Windows Recall and it captures a screenshot of a NYT article and then displays it back to you later is that a reproduction? Surely not. And that's an autonomous system that stores copywritten data and regurgitates it verbatim.

So if it's impractical to actually use it for piracy and it very obviously isn't anyone's intent for it to be used as such then I think it's hard to argue it shouldn't be allowed, even on data that was acquired through back channels.

But copyright is more political than logical so who knows what the legal landscape will be in 5 years, especially when AI companies have every incentive to use their lawyers to pull the ladder up behind them.

j / k navigate · click thread line to collapse

0 comments

13 comments · 5 top-level

mdhb2y ago· 6 in thread

I don’t actually think this is complicated and reading a comment is not the same thing as scraping the internet and you obviously know that.

A few factors that come to mind would be:

- scale

- informed consent which there was none in this case

- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.

llamaimperative2y ago

So here's the question:

kmeisthax2y ago

> Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output?

1 more reply

bigyikesOP2y ago

Thanks, this is a helpful comment.

It isn’t clear to me that these models destroy incentive to create. I mean, ChatGPT can generate comments in my style all day, and yet I’m still incentivized to comment.

I fancy myself a photographer. I still want to take photos even if DALL-E 4 will generate better ones.

What even is the point of creating art? I think there are two purposes: personal expression and enjoyment for others.

People will continue to express themselves even if a bot can produce better art.

And if a bot can produce enjoyment for others en masse, then that seems like a huge win for everybody.

3 more replies

bigyikesOP2y ago

I personally disagree but you make fair points.

Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?

Data usage: Same question as above.

I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.

roywiggins2y ago

People will get mad about the second one, and to the extent what LLMs do is like that, will get mad at LLMs.

2 more replies

cwp2y ago

Reading a comment is exactly the same thing as scraping the internet, you just stop sooner.

cush2y ago· 1 in thread

Reading, no. Selling derivative works using, yes.

cwp2y ago

If I read your comment, then write a reply, is it a derivative work?

renewiltord2y ago· 1 in thread

kmeisthax2y ago

xwolfi2y ago

Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?

Spivak2y ago

If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.

Both of the above are actually being challenged right now but for the time being they're fine.

j / k navigate · click thread line to collapse