undefined | Better HN

story

0 pointsadlpz3y ago0 comments

I guess I'm just afraid that it might not be as good as it is that way.

It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.

In those cases however the output space is so vast that plagiarism is very unlikely.

With code, not so much.

0 comments

jacobr13y ago

GPT-3 and Stable Diffusion might not copy things exactly - but they certainly do copy "style" There are many articles likes this:

https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...

The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.

bjourne3y ago

I think the prompt "GPT-3, tell me what the lyrics for the song Stan by Eminem is" is very likely to output copyrighted material. The same copyrighted material is, of course, already republished without permission on google.com.

odessacubbage3y ago

there are literally thousands of years of artwork that fall under public domain, the idea that the dataset isn't big enough to make good images without copyright infringement and attribution laundering is frankly laughable.

adlpzOP3y ago

My guess is that is not as much about the amount of available data but how accessible it is. Scraping the internet seems to be one of the preferred ways of gathering vast amounts of, in particular, text and images.

Telling apart what's public domain or not is not a trivially automatable task.

If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.

j / k navigate · click thread line to collapse