Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.
The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.
> Feature extraction is literally a form of lossy compression.
This is one way think of neural nets, another is that they find the topological space of pictures.
But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.
Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.
With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.
Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.
That really doesn’t fly legally because any digital format is ‘just’ numbers.
But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.
I can't come up with an argument for either one of those points that holds any water at all.
This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.
Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.
Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.
Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.
https://www.heswithjesus.com/tech/exploringai/index.html
I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.
What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.
So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.
Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.
The second option seems to me to be much simpler, nicer, and more appropriate than the first.