undefined | Better HN

0 pointstedivm3y ago0 comments

We don't need the copyright traps here though as Github openly admits to using the public code for training. They just don't care that they're essentially license laundering code since they can make money doing it.

That said we used copyright traps at Malwarebytes, which is how we found out that iobit was stealing our database.

0 comments

16 comments · 2 top-level

concordDance3y ago· 10 in thread

Using for training doesn't mean its reproduced.

Consider a junior dev who writes a range check function while working for a company (so they own the copyright) then goes to a different company and writes the same range function because that's just how he writes code.

Has copyright been infringed?

Swenrekcah3y ago

That programmer definitely reproduced the code, so if copilot does the same that’s definitely reproduction.

Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

seadan833y ago

> Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

Why is this? Copilot in some ways is an automated way to search code & stack overflow. There is a very annoying website that does nothing more than show relevant code samples of various google search terms.

If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?

3 more replies

jitix3y ago

IMO it’s the same thing because I fundamentally see LLMs in the same role as calculators that helps reduce cognitive load by offloading repetitive work.

Practically with an LLM the programmer can focus on the creative part (handler function, react component, etc) while the LLM generates the necessary boilerplate for the ever changing frameworks and infra configurations. The programmer (and QA) would still review and test everything but would save time writing boilerplate and ship features faster.

ml-anon3y ago

It literally means reproduced in some capacity. Just because its called "training" it doesn't mean it has any reasonable analogy to how humans learn or how expert humans train in a skill.

GPT-style models literally aim to reproduce the input character by character (token by token).

cyanydeez3y ago

They have clean room implementation for just this problem.

The _only_ escape clause is some random function that says how arbitrary a code block is. Or nontrivial.

A person or AI can absolutely be violating copyright via your example.

lallysingh3y ago

Yup. They just copied manually.

blibble3y ago

> Has copyright been infringed?

yes

now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok

see: IBM BIOS

antonvs3y ago

> yes

It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.

This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.

1 more reply

seadan833y ago

As the copyright holder of "throw new", the Junior dev infringed my copyright! Let alone them infringing copyright of the company they crafted that code for.

On a more serious note, there is a question whether algorithms and code blocks can be copyrighted, or if it is the _software_ that is copyrighted. Let's say I use websockets and you crib my usage of websockets for your own application. My opinion is that unless you rebuild the same thing I did, then "cribbing" is the long held art of "let me google how to do that". The artistic creation is the end software product, not really some measly embedded function that is boiler plate (form and function) for anything to work.

The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.

veec_cas_tant3y ago

Easy money idea: when you know an employee will be leaving the company, have them spend their last weeks writing basic, foundational functions in multiple languages!

peytoncasper3y ago· 4 in thread

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

tedivmOP3y ago

We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.

That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.

Okay, so to get to the spirit of your question, lets say Github managed to program a model that worked using only their own code or code that was explicitly put in the public domain. If Github managed to reproduce code that wasn't in the training set, then it can't be accused of copying it. At that point the argument could be made that it independently created it.

At the same time algorithms can't be copyrighted, but implementations of an algorithm can be, so if Github was basically just spitting out an algorithm that just happened to be implemented similarly to how some other code it wasn't trained on implemented it, then I would say there was no copyright violation.

bryanrasmussen3y ago

>We know that isn't the case because we can see code being reproduced even with comments

If the comment is something like

//check fromIndex is greater than toIndex

then that is not any more individualistic or different than the actual function. Sadly, many people comment like this, on the other hand if it reproduced a comment with typos or something more complicated like

/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/

then yeah, then you would have something

2 more replies

visarga3y ago

How about rewording a code snippet so it doesn't exactly replicate the source, but is functionally identical? Could be applied before training. Can we say the LLM only learned the ideas not the expression? Copyright should protect expression and not restrict reusing ideas.

2 more replies

layer83y ago

They’d have to prove to the court that the former is true despite the latter happening, which I imagine would be difficult to do in practice.

j / k navigate · click thread line to collapse

0 comments

16 comments · 2 top-level

concordDance3y ago· 10 in thread

Using for training doesn't mean its reproduced.

Has copyright been infringed?

Swenrekcah3y ago

That programmer definitely reproduced the code, so if copilot does the same that’s definitely reproduction.

Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

seadan833y ago

> Then the legalities can be argued, but an individual is in any case not remotely comparable to a service like copilot.

If the manual version of something is okay (eg: googling for code, finding it, fitting for a new and specific purpose that is similar), why would an automated version of that be any different?

3 more replies

jitix3y ago

IMO it’s the same thing because I fundamentally see LLMs in the same role as calculators that helps reduce cognitive load by offloading repetitive work.

ml-anon3y ago

It literally means reproduced in some capacity. Just because its called "training" it doesn't mean it has any reasonable analogy to how humans learn or how expert humans train in a skill.

GPT-style models literally aim to reproduce the input character by character (token by token).

cyanydeez3y ago

They have clean room implementation for just this problem.

The _only_ escape clause is some random function that says how arbitrary a code block is. Or nontrivial.

A person or AI can absolutely be violating copyright via your example.

lallysingh3y ago

Yup. They just copied manually.

blibble3y ago

> Has copyright been infringed?

yes

now if he had written a specification as to what the function should be, then passed it to someone else that had never seen the function and worked from the spec then he'd be ok

see: IBM BIOS

antonvs3y ago

> yes

It's not nearly that simple. No real copyright case is going to hinge on what a single range check function looks like.

This is human law, it's not a programming situation where you can just apply some simple rule and get a deterministic answer. Context plays a huge part, among other things.

1 more reply

seadan833y ago

As the copyright holder of "throw new", the Junior dev infringed my copyright! Let alone them infringing copyright of the company they crafted that code for.

The 'form and function' clause of copyright almost certainly makes a range check function not a copyright infringement.

veec_cas_tant3y ago

Easy money idea: when you know an employee will be leaving the company, have them spend their last weeks writing basic, foundational functions in multiple languages!

peytoncasper3y ago· 4 in thread

What happens if GitHub didn't use GPL licensed code, but still generated code that was identical to GPL licensed code?

tedivmOP3y ago

We know that isn't the case because we can see code being reproduced even with comments, and Github has been open about the fact that they used everything they had in training.

That said, lets say there's a new model that explicitly excluded closed source and copyleft licenses. Well, the MIT, MPL, Apache, BSD- they all say you can't strip their licensing off.

bryanrasmussen3y ago

>We know that isn't the case because we can see code being reproduced even with comments

If the comment is something like

//check fromIndex is greater than toIndex

/* this hack is because Firefox's implementation of SVG z-indexing does not match how Chrome or Safari does it - please read this article ...url...*/

then yeah, then you would have something

2 more replies

visarga3y ago

2 more replies

layer83y ago

They’d have to prove to the court that the former is true despite the latter happening, which I imagine would be difficult to do in practice.

j / k navigate · click thread line to collapse