undefined | Better HN

0 pointspassword43213y ago0 comments

> train on open source projects

To be specific, the FAQ states: "It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub."

Some have raised concerns that Copilot violates at least the spirit of many open source licenses, laundering otherwise unusable code by sprinkling magic AI dust... most likely leaving the Copilot user responsible for copyright infringement.

0 comments

causi3y ago

Yep. The only reason it hasn't been utterly dogpiled by lawyers is that far fewer people care about code than other forms of IP. If I made an AI assistant called PhotoStar to help with digital art and it just attaches Big Bird's face onto a character in my children's book I'm going to get sued. "Hey now, I just hit paste, the software pressed copy by itself" is not going to hold up.

bastardoperator3y ago

Or the fact that you grant GitHub an implicit license as outlined in the ToS.

eropple3y ago

GitHub has never asked for representation to provide an unlimited-rights license to GitHub themselves for any purpose. Further, the person posting GPLed code to GitHub is not necessarily the only or sole copyright holder, and GitHub has never represented that there was a problem with this.

causi3y ago

GitHub isn't liable. That's been established in court with regards to training AIs. Who is liable is you who may or may not have the legal right to use the code CoPilot spits out for you.

bee_rider3y ago

It seems like this space will open up all sorts of interesting novel legal questions.

It is possible to provide CoPilot with a sequence of inputs that produces some of the input, which was copyrighted. Let's say you want to help people violate copyright, so you as a third party distribute a script that provides that sequence of inputs. Who's violating the copyright there?

Alternatively -- it is apparently legal to produce a clean-room implementation that duplicates a copyright implementation. Supposing you were to use a tool like CoPilot, which has just been trained on that copyright implementation. Is your room still clean? You might even be able to get it to spit out identical functions!

Or, if you have a ML algorithm which has been trained on leaked closed source code, and it is sufficiently over-fitted as to just provide the source code given the filename or the original binary, who is violating copyright when this tool is used? If it is just the end user, then this seems like a really convenient way to launder leaked closed source code.

eclark3y ago

I don't think it's a clear cut as you make it out to be. Tortious interference is a common law remedy that might make Github/MS liable.

If I induce you to break a contract with someone else they can come after me for damages.

For example in this case, there are developers who have created GPL code. That code was licensed to some other developer. Github then encouraged people to upload git copies of the GPL code onto github where it was put into the model. That model contains the copyrighted materials and isn't coming with the necessary notices. The output of the model can be code that is a direct stand in for the copyrighted work. Thus Github have become a party to breaking the license even though they themselves never agreed to the GPL.

In addition Github are encouraging (They are advertising it and making it available broadly) other developers to copy that code and use it in their project. Again that's encouraging an action that breaks a contract. Github is well aware that this is likely happening and they continue on. Thus they might be liable. You also might be liable.

All of these things can and likely will be argued before courts but it's not at all one sided.

> That's been established in court with regards to training AIs.

What are you basing the certainty of this statement on? The case law I have seen around this is pretty spotty. Cases around training on copyrighted materials have predominately been about the input, and not the output. With the final output usually being controlled by the model owner. For example Google obtained the books they scanned legally then used them to produce google books' index. There are some major differences.

- The books were purchased, meaning they got a license to use the book. There's for sure code in the model that Github does not legally have the right to use. They are aware of this. Making the input more shaky for github. - Github is making a direct profit off of this service. It's a revenue generating enterprise. That's important since it raises the bar of what they can be expected to do.

There's been nothing that goes to the supreme court yet; it's all per circuit and not settled case law. Also this gets WAAAAY more complex when we start talking about outside of the US and isn't decided at all.

These things are complex and likely you need your lawyer to advise you with any real questions.

1 more reply

visarga3y ago

This has been explained many times - you can check word for word the output is original. All it takes is a bloom filter trained on the Copilot training set and an ngram extractor.

2 more replies

rasz3y ago

https://i.postimg.cc/0QhH9bS8/dallemini-2022-6-22-0-48-28.pn...

twic3y ago

Fortunately, it can also generate high-quality completely novel characters, every bit as lovable and unthreatening as Big Bird:

https://imgur.com/a/ppeclPL

shadowgovt3y ago

But if you made DALL-E and it just remixes images sourced from a broad scan of the Internet, filtered through several layers of machine learning indirection, you're all good.

causi3y ago

Sure, if it's remixed to the point where most people don't go "hey that's Big Bird!" CoPilot doesn't, or at least doesn't always, like when it just copied Quake's fast inverse square code with the verbatim comments including profanity. Using CoPilot to create commercial code opens the coder to significant liability if there's enough money at stake.

visarga3y ago

That piece of code had duplicates in the training set making it prone to memorisation. Almost all generated code is original.

1 more reply

mnd9993y ago

Just argue that you subcontracted that code to Microsoft in good faith for $10/month and pass on the lawsuit to them.

silisili3y ago

I still can't believe they trained it with open source code, and didn't have some tag system to a) exclude based on licensing, and b) autoinclude licensing, or at least warn about it before applying code. Especially when many cases were shown of it line by line writing code from the same exact codebase.

lamontcg3y ago

Another concern is that nearly every stackoverflow answer or wikipedia article that isn't a trivial algorithm tends to be buggy at its edge conditions. Most of them look like they were submitted by college students and not experts.

shadowgovt3y ago

Remember when we believed that experts were over because the wisdom of the crowds would reign supreme?

Been a hell of a decade, hasn't it.

xpe3y ago

The "wisdom of the crowds" doesn't mean what many people think it means.

The wisdom of crowds works best when:

1. participants are independent (otherwise you may get failure modes, such as "groupthink" or "information cascades")

2. participants are informed, but in different ways, with different opinions;

3. there is a clear, accepted aggregation mechanism, where individual errors "cancel out" to some degree

I view the topics in James Surowiecki's book (or the Wikipedia summary of it, at least) as required thinkinpg for everyone, preferably synthesized with a study of statistics and political economy.

In particular, the Wikipedia article's section on "Five elements required to form a wise crowd" is a slightly different slicing of the required elements that I offer above.

* If you read that section, trust is listed. I, however, don't see trust as a necessary condition for a "wise crowd". Trust is often useful (or even necessary) when a collective decision is used for governance, decision-making, and policy.

TimTheTinker3y ago

When the wisdom of the crowds is all easily accessible, the hard part becomes curating.

taftster3y ago

This is legit. While it seems it takes forever to bring this kind of stuff to trial, it will be an interesting case for sure. Especially in the broader more general sense.

AI is just recomposition of existing snippets of code, art, text, music, etc. Does an AI fall under fair use? What happens when an AI produces something too similar to an existing work or trademark. I know the computer won't get sued, the owner/user will. But still, it's a hard problem.

Even if Copilot was initialized with snippets from Open Source Software (exclusively), it doesn't mean that copyright infringement isn't a concern.

visarga3y ago

> AI is just recomposition of existing snippets of code, art, text, music, etc.

It's not random recomposition, which is worthless. It's useful recomposition, adapted to the request and context. It adds something of its own to the mix.

capableweb3y ago

Not to mention that just because the code is public, doesn't mean you can use it however you want. You can publish code and still retain copyright. Wonder if GitHub looked at the license when they gathered the data for the model.

6gvONxR4sf7o3y ago

It seems unfortunately clear that generative ML as typically practiced falls under fair use of even the most restrictive licenses or lack thereof (e.g. a training set including disney movies without disney’s permission). Some people say that’s great and it’s legal hooray, but I would love it if the law caught up and added requirements to the models trained this way. If you benefit from other people’s stuff without their permission then you ought to have to give back in some way.

cauefcr3y ago

What is actually crazy is having copyright/patents/whatever apply to mathematical structures and code, and be retainable for long, it's rent on ideas, such a ridiculous concept.

munchler3y ago

Copyright and patents are very different. I think the general consensus among developers is that software patents are silly, but copyright on source code is very important.

visarga3y ago

If you can't prove your code was stolen you shouldn't have a claim. And Codex should just skip code that exists in the training set. All that remains is creative code.

marshray3y ago

Would a cartoon about Mickey Duck and Donald Mouse be infringing?

visarga3y ago

You can work on the definition of "similar code". It can be a separate model on its own. Use human judgements to learn it.

Gigachad3y ago

It’s hardly different from reading those projects yourself and learning from them.

adamckay3y ago

Learning from them would be fine, reproducing them as-is without abiding by the license is not and that's where the difference lies.

j / k navigate · click thread line to collapse

0 comments

causi3y ago

bastardoperator3y ago

Or the fact that you grant GitHub an implicit license as outlined in the ToS.

eropple3y ago

causi3y ago

GitHub isn't liable. That's been established in court with regards to training AIs. Who is liable is you who may or may not have the legal right to use the code CoPilot spits out for you.

bee_rider3y ago

It seems like this space will open up all sorts of interesting novel legal questions.

eclark3y ago

I don't think it's a clear cut as you make it out to be. Tortious interference is a common law remedy that might make Github/MS liable.

If I induce you to break a contract with someone else they can come after me for damages.

All of these things can and likely will be argued before courts but it's not at all one sided.

> That's been established in court with regards to training AIs.

These things are complex and likely you need your lawyer to advise you with any real questions.

1 more reply

visarga3y ago

This has been explained many times - you can check word for word the output is original. All it takes is a bloom filter trained on the Copilot training set and an ngram extractor.

2 more replies

rasz3y ago

https://i.postimg.cc/0QhH9bS8/dallemini-2022-6-22-0-48-28.pn...

twic3y ago

Fortunately, it can also generate high-quality completely novel characters, every bit as lovable and unthreatening as Big Bird:

https://imgur.com/a/ppeclPL

shadowgovt3y ago

But if you made DALL-E and it just remixes images sourced from a broad scan of the Internet, filtered through several layers of machine learning indirection, you're all good.

causi3y ago

visarga3y ago

That piece of code had duplicates in the training set making it prone to memorisation. Almost all generated code is original.

1 more reply

mnd9993y ago

Just argue that you subcontracted that code to Microsoft in good faith for $10/month and pass on the lawsuit to them.

silisili3y ago

lamontcg3y ago

shadowgovt3y ago

Remember when we believed that experts were over because the wisdom of the crowds would reign supreme?

Been a hell of a decade, hasn't it.

xpe3y ago

The "wisdom of the crowds" doesn't mean what many people think it means.

The wisdom of crowds works best when:

1. participants are independent (otherwise you may get failure modes, such as "groupthink" or "information cascades")

2. participants are informed, but in different ways, with different opinions;

3. there is a clear, accepted aggregation mechanism, where individual errors "cancel out" to some degree

I view the topics in James Surowiecki's book (or the Wikipedia summary of it, at least) as required thinkinpg for everyone, preferably synthesized with a study of statistics and political economy.

In particular, the Wikipedia article's section on "Five elements required to form a wise crowd" is a slightly different slicing of the required elements that I offer above.

TimTheTinker3y ago

When the wisdom of the crowds is all easily accessible, the hard part becomes curating.

taftster3y ago

This is legit. While it seems it takes forever to bring this kind of stuff to trial, it will be an interesting case for sure. Especially in the broader more general sense.

Even if Copilot was initialized with snippets from Open Source Software (exclusively), it doesn't mean that copyright infringement isn't a concern.

visarga3y ago

> AI is just recomposition of existing snippets of code, art, text, music, etc.

It's not random recomposition, which is worthless. It's useful recomposition, adapted to the request and context. It adds something of its own to the mix.

capableweb3y ago

6gvONxR4sf7o3y ago

cauefcr3y ago

What is actually crazy is having copyright/patents/whatever apply to mathematical structures and code, and be retainable for long, it's rent on ideas, such a ridiculous concept.

munchler3y ago

Copyright and patents are very different. I think the general consensus among developers is that software patents are silly, but copyright on source code is very important.

visarga3y ago

If you can't prove your code was stolen you shouldn't have a claim. And Codex should just skip code that exists in the training set. All that remains is creative code.

marshray3y ago

Would a cartoon about Mickey Duck and Donald Mouse be infringing?

visarga3y ago

You can work on the definition of "similar code". It can be a separate model on its own. Use human judgements to learn it.

Gigachad3y ago

It’s hardly different from reading those projects yourself and learning from them.

adamckay3y ago

Learning from them would be fine, reproducing them as-is without abiding by the license is not and that's where the difference lies.

j / k navigate · click thread line to collapse