undefined | Better HN

0 pointsvisarga2y ago0 comments

> Just learn to recognize and punish plagiarism via RLHF.

This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.

But this will blow up in their face. Let's see:

- AI companies will start investing much more in content attribution

- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret

- Then people will start seeing a chilling effect on creativity

- We must also check NYT against all the other sources, not everything the write is original

0 comments

11 comments · 3 top-level

groceryheist2y ago· 6 in thread

Maybe the bloom filter solution is enough, but I wonder.

- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.

- As n gets big, the bloom filter has to also.

If/when attribution is solved for LLMs (and not fake attribution like from Bing or Perplexity) then creators can be compensated when their works are used in AI outputs. If compensation is high enough this can greatly incentivize creativity, perhaps to the point of realizing "free culture" visions from the late 90s.

visargaOP2y ago

As n-gram length grows, we are still going to have the same number of ngrams, they go through a hashing function and indexed in the bloom filter as usual. The number of n-grams size n in a text is text_length - ngram_length + 1.

groceryheist2y ago

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.

1 more reply

geysersam2y ago

> if compensation is high enough

Who pays the compensation? If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

starttoaster2y ago

> If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?

A retort to this would be that both NYT and ChatGPT are on the internet, so it's no added fuss of hopping in my car, driving to Walmart, and picking up a DVD case. My response to it would be that both the LLM and Netflix are content aggregators to the user. I can read the NYT, or I can read the NYT summary on ChatGPT and ask it for life advice with my pet hamster, or ask it how to reverse a linked list in bash.

groceryheist2y ago

The LLM users/middlemen pay. The user probably pays less than they would have to pay the author. The LMM provides information retrieval / discovery.

sideshowb2y ago

I like the idea but seems like there would be big problems. Like detecting if a work is reworded. Or a large number of sources have all slightly influenced a small response - isn't that pretty much considered new knowledge?

Then there's the issue that however you credit attribution, it creates a game of enshittified content creation with the aim of being attributed as often as possible, regardless of whether the content really offered anything that wasn't out there already.

mike_hearn2y ago· 2 in thread

I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.

Specifically, the NYT examples all seem to be cases where they asked the AI to repeat their articles verbatim? So they ask it to violate copyright and because it's a helpful bot with a good memory, it does so.

Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.

So the direct problem the NYT is complaining about - a paywall bypass - is already rectified. Now it would seem to me like the case is quite weak. They could demand OpenAI pay them damages for the time ChatGPT wasn't refusing, but wouldn't they have to prove damages actually happened? It seems unlikely many people used ChatGPT as a paywall bypass for the NYT specifically in the past year. It only knows old articles. OpenAI could be ordered to search their logs for cases where this happened, for example, and then the NYT could be ordered to show their working for the value of displaying a single old article to a non-subscriber, and from that damages could be computed. But it wouldn't be a lot.

That's presumably why the case goes further and argues that OpenAI is in violation even when it isn't repeating text verbatim. That's the only way the NYT can get any significant money out of this situation.

But this case seems much weaker to me. Beyond all the obvious human analogies, there is precedent in the case of search engines where they crawl - and the NYT let them crawl - specifically to enable the creation of a derived data structure. Search engine indexes are understood to be fair use, and they actually do repeat parts of the page verbatim in their snippets. Google once even showed cached versions of whole pages. And browser makers all allow extensions in their stores that strip ads and bypass paywalls, and the NYT hasn't sued them over that either.

cycomanic2y ago

This is not how copyright works though. The verbatim quoting of articles is because when people brought up these questions initially the argument was that the NN doesn't really contain the training data or really just in an abstract, condensed way that does not constitute copying of the content.

This demonstrates that no, the NN actually does contain the full articles, copied into the NN. Do you think any normal person would get away with copying MS windows by e.g. zipping it together with some other OS on the same medium. Why should we let OpenAI get away with this?

mike_hearn2y ago

Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.

> Why should we let OpenAI get away with this?

IP rights, like other private property rights, are a compromise between creators and consumers. What "should" be the case is essentially an argument about what balance creates the best overall outcomes. LLMs, for now, require large amounts of text to train, so the question is one of whether we want LLMs to exist or not. That's really a question for Congress and not the courts, but it'll be decided in the courts first.

dyno123452y ago

https://en.wikipedia.org/wiki/W-shingling

j / k navigate · click thread line to collapse

0 comments

11 comments · 3 top-level

groceryheist2y ago· 6 in thread

Maybe the bloom filter solution is enough, but I wonder.

- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.

- As n gets big, the bloom filter has to also.

visargaOP2y ago

groceryheist2y ago

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.

1 more reply

geysersam2y ago

> if compensation is high enough

Who pays the compensation? If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

starttoaster2y ago

> If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?

If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?

groceryheist2y ago

The LLM users/middlemen pay. The user probably pays less than they would have to pay the author. The LMM provides information retrieval / discovery.

sideshowb2y ago

mike_hearn2y ago· 2 in thread

I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.

Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.

cycomanic2y ago

mike_hearn2y ago

Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.

> Why should we let OpenAI get away with this?

dyno123452y ago

https://en.wikipedia.org/wiki/W-shingling

j / k navigate · click thread line to collapse