They probably didn't rigorously track the licensing issue, but I'm pretty sure training a LLM is completely acceptable use of source under Freely licensed code. It would be somewhat amusing though if CoPilot is forced to spit out the license for every piece of code used to develop the derivative work, along with copyright notices and whatever else the licenses may require.
There's a more interesting question about the copyright status of the code it outputs, since the language model is sort of like a compiler, but also not like a compiler since the output is based on other people's copyrighted code.
I feel a lot of people get caught up on the output code and completely ignore the fact that copilot itself is likely a massive copyright violation.
Copyrighted content can be used without the holder’s permission under “Fair Use”.
Don’t assume all code can be copyrighted. Purely functional expressions are not copyrightable. Code is math.
There’s a lot here to unpack.
Would love to see this being done on decompiled proprietary code. Training done on it. And released into the wild.
But the amount of data necessary, and computing power to do it might not be available for the common person.
Here's a good link explaining the Fair Use test: https://copyright.columbia.edu/basics/fair-use.html
> Held: 2 Live Crew's commercial parody may be a fair use within the meaning of § 107. Pp. 574-594.
The ruling states explicitly that commercial usage can be a determining factor in determining whether usage is fair or not, but that it does not in and of itself make the use "unfair".
To me, the latter scenario is much closer to what Github is doing with Copilot, which is one of the things the plaintiffs are alleging violates open source licenses.
> Additionally, the district court determined that the commercial nature of Google's use weighed against its transformative nature. Although Kelly held that the commercial use of the photographer's images by Arriba's search engine was less exploitative than typical commercial use, and thus weighed only slightly against a finding of fair use, the district court here distinguished Kelly on the ground that some website owners in the AdSense program had infringing Perfect 10 images on their websites. The district court held that because Google's thumbnails "lead users to sites that directly benefit Google's bottom line," the AdSense program increased the commercial nature of Google's use of Perfect 10's images.
> In conducting our case-specific analysis of fair use in light of the purposes of copyright, we must weigh Google's superseding and commercial uses of thumbnail images against Google's significant transformative use, as well as the extent to which Google's search engine promotes the purposes of copyright and serves the interests of the public. Although the district court acknowledged the "truism that search engines such as Google Image Search provide great value to the public," the district court did not expressly consider whether this value outweighed the significance of Google's superseding use or the commercial nature of Google's use. The Supreme Court, however, has directed us to be mindful of the extent to which a use promotes the purposes of copyright and serves the interests of the public.
---
I will also draw attention to:
> The fact that Google incorporates the entire Perfect 10 image into the search engine results does not diminish the transformative nature of Google's use. As the district court correctly noted, we determined in Kelly that even making an exact copy of a work may be transformative so long as the copy serves a different function than the original work.
https://www.zdnet.com/article/linux-developer-abandons-vmwar...
https://www.zdnet.com/article/vmware-sued-for-failure-to-com...
So, if the courts find in Microsoft and OpenAI’s favor (which remains to be seen despite the many armchair lawyers here), your license would mean jack squat.
Microsoft never changes. Always looking for a dishonest buck. Does 'Embrace, Extend, and Extinguish' ring a bell for younger players? Thought not.
> OpenAI, Microsoft want court to toss lawsuit accusing them of abusing open-source code
Seems pretty obvious to me but we'll see how it goes in the court.
One extreme is AI is allowed to spit out copyrighted code verbatim as long as it technically goes through an AI first. Of course that defeats all open-source languages by adding a backdoor around them.
The other extreme is that AI is not allowed to spit out a single line of copyrighted code, in which case we'll have endless lawsuits to figure out if CodeGPT used a GPL-licensed fast inverse square root or if it used the public-domain fast inverse square root.
I think we'll land somewhere in the middle: If an AI regurgitates a "substantial" number of lines of code, then it's creators can be held liable (a.k.a. the "we'll know it when we see it" standard.)