To be specific, the FAQ states: "It has been trained on natural language text and source code from publicly available sources, including code in public repositories on GitHub."
Some have raised concerns that Copilot violates at least the spirit of many open source licenses, laundering otherwise unusable code by sprinkling magic AI dust... most likely leaving the Copilot user responsible for copyright infringement.
It is possible to provide CoPilot with a sequence of inputs that produces some of the input, which was copyrighted. Let's say you want to help people violate copyright, so you as a third party distribute a script that provides that sequence of inputs. Who's violating the copyright there?
Alternatively -- it is apparently legal to produce a clean-room implementation that duplicates a copyright implementation. Supposing you were to use a tool like CoPilot, which has just been trained on that copyright implementation. Is your room still clean? You might even be able to get it to spit out identical functions!
Or, if you have a ML algorithm which has been trained on leaked closed source code, and it is sufficiently over-fitted as to just provide the source code given the filename or the original binary, who is violating copyright when this tool is used? If it is just the end user, then this seems like a really convenient way to launder leaked closed source code.
If I induce you to break a contract with someone else they can come after me for damages.
For example in this case, there are developers who have created GPL code. That code was licensed to some other developer. Github then encouraged people to upload git copies of the GPL code onto github where it was put into the model. That model contains the copyrighted materials and isn't coming with the necessary notices. The output of the model can be code that is a direct stand in for the copyrighted work. Thus Github have become a party to breaking the license even though they themselves never agreed to the GPL.
In addition Github are encouraging (They are advertising it and making it available broadly) other developers to copy that code and use it in their project. Again that's encouraging an action that breaks a contract. Github is well aware that this is likely happening and they continue on. Thus they might be liable. You also might be liable.
All of these things can and likely will be argued before courts but it's not at all one sided.
> That's been established in court with regards to training AIs.
What are you basing the certainty of this statement on? The case law I have seen around this is pretty spotty. Cases around training on copyrighted materials have predominately been about the input, and not the output. With the final output usually being controlled by the model owner. For example Google obtained the books they scanned legally then used them to produce google books' index. There are some major differences.
- The books were purchased, meaning they got a license to use the book. There's for sure code in the model that Github does not legally have the right to use. They are aware of this. Making the input more shaky for github. - Github is making a direct profit off of this service. It's a revenue generating enterprise. That's important since it raises the bar of what they can be expected to do.
There's been nothing that goes to the supreme court yet; it's all per circuit and not settled case law. Also this gets WAAAAY more complex when we start talking about outside of the US and isn't decided at all.
These things are complex and likely you need your lawyer to advise you with any real questions.
Been a hell of a decade, hasn't it.
The wisdom of crowds works best when:
1. participants are independent (otherwise you may get failure modes, such as "groupthink" or "information cascades")
2. participants are informed, but in different ways, with different opinions;
3. there is a clear, accepted aggregation mechanism, where individual errors "cancel out" to some degree
I view the topics in James Surowiecki's book (or the Wikipedia summary of it, at least) as required thinkinpg for everyone, preferably synthesized with a study of statistics and political economy.
In particular, the Wikipedia article's section on "Five elements required to form a wise crowd" is a slightly different slicing of the required elements that I offer above.
* If you read that section, trust is listed. I, however, don't see trust as a necessary condition for a "wise crowd". Trust is often useful (or even necessary) when a collective decision is used for governance, decision-making, and policy.
AI is just recomposition of existing snippets of code, art, text, music, etc. Does an AI fall under fair use? What happens when an AI produces something too similar to an existing work or trademark. I know the computer won't get sued, the owner/user will. But still, it's a hard problem.
Even if Copilot was initialized with snippets from Open Source Software (exclusively), it doesn't mean that copyright infringement isn't a concern.
It's not random recomposition, which is worthless. It's useful recomposition, adapted to the request and context. It adds something of its own to the mix.