undefined | Better HN

0 pointsmakk3y ago0 comments

IANAL but I work alongside them. Here's an argument I've heard.

You can read the data to train a thing. So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

When that thing later generates an output, the output isn't copyrightable because it's machine generated (this is the current US position) and it isn't a copyright violation because it was generated, not copied.

You can launder copyrighted material through an LLM, basically.

0 comments

8 comments · 5 top-level

contravariant3y ago· 3 in thread

> So long as that thing doesn't literally copy the data into itself then the training hasn't violated copyright.

Could luck proving that hasn't happened. If a language model that can reproduce the code verbatim doesn't count then a movie re-encoded into a different format shouldn't count either.

numpad03y ago

I think it’s unproven either ways. Courts ignore definition wars so long both party ignores it too. If you sued me for stealing some cheese, the justice system won’t care if it actually had been cheese or what actually cheese is so long I stick to such insistence that I didn’t do anything wrong up to your accusation.

prepend3y ago

The movie re-encoded into a different format is perfectly fine if the work is sufficiently different.

Taking a file of wolf of Wall Street and encoding it so all the oranges are blue but there’s no other changes is bad as that’s clearly a derived work.

Taking the same file and scrambling it so it doesn’t resemble is perfectly fine.

Watching the movie and then making your own version of the same exact plot points is infringement. But using plot points that are changed is perfectly fine.

There’s existing copyright law that prevents the makers of the movie Deep Impact from suing the makers of Armageddon.

belorn3y ago

A 420p copy of the movie is very different from a 4k version. Similarly an encryption copy of the file will intentionally not resemble the original in any way. Both of those distinctions would likely be ignored by a court.

Similar, a movie that copies the plot points will likely be fine, but a song that copies the notes of a song will not be. Very different cover version will sometimes be found as a derived work, even when they are as different as Deep Impact is to Armageddon.

madeofpalk3y ago

Which, by the way, this is completely untested in courts.

Courts have decided, after a bunch of case-by-case decisions, that sampling a song consitutes creating a derivitive work, and you must obtain a license from a copyright holder to do so.

It is my opinion that training a model copies and creates derivitive work on what you used to train it, so you must have a license to train LLMs on content. I am not a lawyer, I am no one, my opinion here is worthless.

We already know that you can create a copy of something without doing a bit-for-bit duplicate because a) copyright law existed before we had bits, and b) transcoding a movie still counts as creating a copy. Recording my own VHS of HBO and selling it is still illegal.

kmeisthax3y ago

Most generative AI actually does have significant problems with the model copying the data into itself. Not literally - there isn't a bunch of model parameters that line up to the exact PNG bitstream of particular images. But courts wouldn't care as long as the model outputs something that looks "close enough", because the chain of provenance is clearly established from the training set, through gradient descent and the model weights, into the final output.

There's a paper from Google and Princeton about regurgitation happening in Stable Diffusion and Imagen: https://arxiv.org/pdf/2301.13188.pdf

OpenAI also had to spend a bunch of time on deduplicating an insanely large dataset to prevent this from happening in DALL-E: https://openai.com/research/dall-e-2-pre-training-mitigation...

I have no clue how they handled this in GPT-3 or -4. Given the amount of regurgitation found in Copilot I imagine there's lots of significant code fragments floating about nominally different projects that a deduplicator wouldn't match as identical.

shagie3y ago

We've been laundering licenses on code on Stack Overflow and the rest of the SE network (whatever to CC (which can then be GPL'ed)) for a decade now.

Consider https://softwareengineering.stackexchange.com/questions/2695...

The source code is GPL'ed, but that page is CC BY-SA 3.0.

It's also fairly easy to assume that a fair bit of material on SO that was copied from employer's codebases into SO (and thus now CC) can be included in GPL code now too.

tick_tock_tick3y ago

> You can launder copyrighted material through an LLM, basically.

Good!

j / k navigate · click thread line to collapse