undefined | Better HN

0 pointsRetric2y ago0 comments

I’m not convinced any of the output of these generative AI is free from copyright issues. Consider, a ROT13 copy of a book may at first glance look nothing like the original, but distributing digital copies would be clear copyright infringement.

Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.

0 comments

9 comments · 3 top-level

spott2y ago· 2 in thread

I think this is basically the same as the sentiment that there is no such thing as a truly novel idea.

The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.

> Feature extraction is literally a form of lossy compression.

This is one way think of neural nets, another is that they find the topological space of pictures.

But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.

Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.

With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.

Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.

RetricOP2y ago

> it’s just a bunch of numbers and code

That really doesn’t fly legally because any digital format is ‘just’ numbers.

spott2y ago

Yes, you are correct, I was being flippant.

But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.

I can't come up with an argument for either one of those points that holds any water at all.

zarzavat2y ago· 2 in thread

Copyright is not cooties. For something to be infringing it has to be beyond the de minimis threshold. It’s not enough to show that a copyrighted work influenced another work, there needs to be some substantial level of copying.

This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.

RetricOP2y ago

The de minimis threshold is shockingly low as seen in various successful lawsuits.

Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.

Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.

zarzavat2y ago

Exactly, but the kinds of things that Copilot is taking are necessarily very generic. It’s not going to be taking the “special sauce” from an open source project, because that is very unlikely to be the most probable continuation of any prompt that would occur in normal usage.

Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.

nickpsecurity2y ago· 2 in thread

The foundational models most coding models are built on May have comments and code in them. They’re almost certainly built on a number of legal violations, including copyright infringement. I have details in “Proving Wrongdoing” section here:

https://www.heswithjesus.com/tech/exploringai/index.html

I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.

What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.

spott2y ago

The problem with this line of thinking is that a person can also cut and paste code that they don’t have a license to use… but until they do, they haven’t done anything wrong by reading the code.

So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.

Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.

The second option seems to me to be much simpler, nicer, and more appropriate than the first.

nickpsecurity2y ago

I agree with you. In fact, the latter is in my Alternative Models section.

j / k navigate · click thread line to collapse

0 comments

9 comments · 3 top-level

spott2y ago· 2 in thread

I think this is basically the same as the sentiment that there is no such thing as a truly novel idea.

The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.

> Feature extraction is literally a form of lossy compression.

This is one way think of neural nets, another is that they find the topological space of pictures.

But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.

RetricOP2y ago

> it’s just a bunch of numbers and code

That really doesn’t fly legally because any digital format is ‘just’ numbers.

spott2y ago

Yes, you are correct, I was being flippant.

I can't come up with an argument for either one of those points that holds any water at all.

zarzavat2y ago· 2 in thread

RetricOP2y ago

The de minimis threshold is shockingly low as seen in various successful lawsuits.

Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.

zarzavat2y ago

nickpsecurity2y ago· 2 in thread

https://www.heswithjesus.com/tech/exploringai/index.html

I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.

spott2y ago

So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.

The second option seems to me to be much simpler, nicer, and more appropriate than the first.

nickpsecurity2y ago

I agree with you. In fact, the latter is in my Alternative Models section.

j / k navigate · click thread line to collapse