undefined | Better HN

0 pointsmattbee2y ago0 comments

I don't know what a fair settlement would be but I'm looking forward to a copyright-holder suing OpenAI to obtain one. These companies have no value if copyright can be enforced on their training data.

0 comments

3 comments · 1 top-level

visarga2y ago· 2 in thread

I think there are ways around it. The simplest would be to generate replacement data, for example by paraphrasing the original, or summarising, or turning it into question-answer pairs. In this new format it can serve as training data for a clean LLM. Of course the public domain data would be used directly, no need to go synthetic there.

An important direction would be to train copyright attribution models, and diff-models to detect when a work is infringing on another, by direct comparison. They would be useful to filter both the training set and the model outputs.

mattbeeOP2y ago

Would automated paraphrasing not be a derivative work of the original?

visarga2y ago

So you think any paraphrase of a copyrighted phrase is in copyright violation? That's like owning the idea itself. Is any utterance similar to this one now forbidden?

1 more reply

j / k navigate · click thread line to collapse