Any time someone invents something new and incredible, there's always a crowd of negative nancies eager to discredit and explain why the invention is nothing new and a detrement to society.
I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge.
'co-pilot just sells code other people wrote' is such a ridiculous understatement of what co-pilot does. Instead of marvelling at the human ingenuity that went into creating it, they sneer at the audacity of openAI to do something without first asking their permission.
Just because you’ve made something cool doesn’t give you the right to harm others in the process.
If MS or OpenAI don’t think this is the case then they should have also included their private repositories.
Something being cool doesn't exempt it from discussion of its ethics and certainly doesn't exempt it from legal consequences. Often what people call "disruption" is often just exploiting resources/people/their work in unsustainable ways until oversight is introduced.
If CoPilot is copy/pasting large amount of code with unknown licenses, that is a large and real risk for users aside from violating open source projects licenses.
Because they shared the code under a license, and they have the right to complain if people use that code but don't follow the license.
For example, what happens if Github Copilot spits a copy of some copyrighted code verbatim? Is laundering open source code through a machine learning model a loophole for not having to follow the license?
Often following the license is as simple as giving credit to the original author.
People don’t have a problem that AI is being used in some form to provide the service.
The complaint is pretty clearly that code is being lifted from repositories without attribution or compensation, and being redistributed into other applications.
How impressive the work behind copilot is or is not really isn’t relevant.
No, I won't, because Disney has fancy lawyers, the average open source developer hasn't. What you are saying is: Screw little people, let M$ make their money.
Either copyright is for everyone, or for no one. I prefer the latter, but this is not the world we live in.
That’s all I ask — if you use my code, give me credit.
Stealing my code to train your bot — which will replicate portions verbatim! — is no different whatsoever than the casual plagiarist that copies and pastes a novel snippet manually.
Its absolutely my legal and ethical prerogative to complain about people stealing my code by failing to respect the license under which it was freely provided.
Reselling other people's content like this without attribution (which, is a pretty mild form of payment) is not nice. But at least you now have one more reason in the list of reasons why Microsoft acquired Github: to be able to launder their open source contributions and resell them.
There are real, serious, and genuinely interesting issues to be discussed regarding Copilot. It is neither "just selling code that other people wrote", nor is it something that we should applaud merely because it demonstrates "human ingenuity".
The comments here regarding this are honestly a total dumpster fire. It's mostly a bunch of paper-thin hot takes, either:
- The blatantly stupid "you willingly shared your code so why are you complaining that one of the world's biggest companies is now hoovering up code from your carefully-selected open-source license and reselling it as a service!!!"
- The blatantly lying "I have literally never looked at any other computer software while developing any obviously anybody who has ever seen other source code is a plagarist"
It's dumb because there is an actual interesting discussion here but I guess we're not going to bother having it.
I put a fucking license on it so that it doesn't get abused by some fucking corporation. Jesus Christ, it's not hard to understand.
I don't make Free software so that Microsoft can sell it to people for use in proprietary projects.
But the world we actually live in is one where corporations have copyright, and individuals don't.
That's what irks people, I think rightly.
People like you should understand that publicly available code doesn't mean "do whatever you want" code.
The majority of publicly available code hosted on Github as a license that tells you what you can and what you cannot do with that code.
If someone uses this code without respecting the license, authors have the right to complain and even legally enforce the license if they want.
Now, you should know that there's nothing "cool" to take other people's work without permission.
That's likely the crux of the issue. If you do it right, you can steal from other people and get rich. Meanwhile, those same people (whose work was stolen) may be left out in the cold no matter how original, creative, hardworking etc they are.
because it's not unconditional, there are often licence terms of usage, and copilot is potentially laundering those.
For other individuals to collaborate, to make the software available to other people, etc. Certainly not for github's profit and much less for the benefit of github's customers who will have access to open code that violates license agreements.
It’s a product by a business. Why is that not open to criticism?
If I, as a human, go to a public repository on Github and copy/paste a non-trivial 200 line code snippet into my proprietary code base I have to abide by the license of that original code, even if I slightly modify it. I don't see how this cannot be true for Copilot. I'm sure the legal folks at Github have thought of a response though, you could e.g. argue that the snippets produced by Copilot are not affected by the copyright of the original author as they do not reach the required treshold of originality. Seems rather shaky for me though.
It is not true. Whenever there is something really useful, everybody is happy, and while of course they always are some nansayers, they're very few.
However, when you do something controversial, you can expect to hear criticism. You are of course free to dismiss that criticism, but when a lot of people are telling you what you are doing is unethical, maybe it's time to stop and think about it.
One big reason I support it is because it grants me the right and ability to change things I need/want to change.
Moreover it is possible to BOTH marvel at the human ingenuity that went into making copilot AND disagree with their methods. Some things can be marvelous and wrong at the same time.
Sorry for the unproductive tone of this comment, but there's something about the attitude of this tweet that really grinds my gears. Any time someone invents something new and incredible, there's always a crowd of negative nancies eager to discredit and explain why the invention is nothing new and a detrement to society. I don't understand why someone would willingly share their code on github where it is publicly available just to complain when others make use of that knowledge. 'co-pilot just sells code other people wrote' is such a ridiculous understatement of what co-pilot does. Instead of marvelling at the human ingenuity that went into creating it, they sneer at the audacity of openAI to do something without first asking their permission.
— This comment brought to you by HN-Comment-AI ©
Not bad for everyday use - I like "nattering nabobs of negativism" (as scripted by William Safire), but it is really a bit over the top.
Copilot is NOT SELLING coed other people wrote, it is simply acting as a curator to show you all the solutions people HAVE WRITTEN for free.
Copilot does NOT write entire programs, it's simply an assistant. And there is not much copyright you CAN apply to 3-4 lines of generally understandable code.
I've used Copilot and am actively paying for and I have not seen many cases where it's generating bad code. It's only there to remove boilerplate and common problems, not there to write entire applications.
Why are people getting so salty?
Copyright and licensing are bad, actually. Stop getting worked up about the idea of using courts to punish theft. Stop getting into a frenzy of arousal about the police kicking down doors to drag Billy Gates to jail because 80 characters of fast square root is theft but 79 isn't.
Where on earth is the ambition and vision!? Knowledge is public domain. A commons of knowledge is a public good. The cost of code copying is zero.
Sure in our day job we have to pretend to care about this stuff. But when did the ideological scope of what can be achieved become rules lawyering over license text.
Copy my MIT licensed code without attribution? I don't give a shit, go ahead, I hope it helps, in fact I want a truly public domain license but copyright law is so hostage to corporate interests no such thing exists in many countries.
Free the code.
Yes but this copilot model takes that, adds value and doesn't itself join the public common good. Instead it takes it, and makes you pay to have it back in another form.
If copilot were open source and the model released for the public good, being built of public data (in your scenario) we would have a very different conversation.
Until that happens, and copyright protections are still used by larger entities, using the same system to protect yourself and (more importantly) your users isn't turning your back on your ideals, but instead simply adjusting your strategy to the current material conditions. Remember that Google v. Oracle (while ultimately a win versus what could have been) was a step back, with de minimis claims left on the table as not a valid defense. The play field is heavily slanted towards the big players and software freedom requires every tool it can put it's hands on at the moment.
This is my feeling as well. I don't build stuff in the open so that I can get bent out of shape at someone not properly licensing it. It's in a public repository, FFS... I assume that if anyone even notices my repo, that they may copy/paste a few lines out of my solution if it helps them.
If an AI takes a copyright work and makes its own version-- say combining two novels by popular authors in a way that is unique but keeps large parts of the text intact, can I sell that? I think if I were the authors I would be unhappy.
Also, how hard would it be for copilot to include a comment saying "// I got this line from x repo" when you are copying from a new repo? I am guessing not hard at all. Then at least the user would be aware of where their code was coming from and could be expected to make a judgement. If the line is "let a = b" then probably no worries. But if it is hundreds of lines of a simulation, all from the same repo with no changes, then I think some attribution is good for both parties.
Me too. I also find three iterations of the same subject not enough discourse. We need to take this matter more seriously.
> But it has made me realize why I instinctively dislike Free Software as a movement.
On the other hand, this whole discourse reminds me why I absolutely love Free Software as a movement.
> Copyright and licensing are bad, actually.
This is why we have "Copyleft".
> Stop getting into a frenzy of arousal about the police kicking down doors to drag Billy Gates to jail because 80 characters of fast square root is theft but 79 isn't.
And, stop getting into frenzy of arousal about being able to use any and every code piece you see elsewhere in any project regardless of its license.
> Where on earth is the ambition and vision!? Knowledge is public domain. A commons of knowledge is a public good. The cost of code copying is zero.
This is why GPL is important. It forces knowledge to evolve in the open, stay in the public domain and help it actually makes public good. It also doesn't hinder ambition and vision by not taking it to private domain, and keeping it open to everyone.
> Sure in our day job we have to pretend to care about this stuff. But when did the ideological scope of what can be achieved become rules lawyering over license text.
You might be pretending to care about this in your daily job, but we really care. Some of the projects I take part can't ever include GPL code (because the projects are MIT licensed). These texts are court-tested licenses, so they're as proper and serious agreements as the EULAs of "particular" software companies.
> Copy my MIT licensed code without attribution? I don't give a shit, go ahead, I hope it helps, in fact I want a truly public domain license but copyright law is so hostage to corporate interests no such thing exists in many countries.
If I want my code to be copied and possibly closed, I'll license it with MIT or BSD-0 and forget about it, but if I'm licensing my code with GPL3, it means I want that code to stay open. As a license, I expect anyone using that code to respect that license.
> Free the code.
Yes, and respect the license the author selected for his/her code.
I don't think users owe me anything at all. If people want to PR back that's cool but if not that's cool too.
I think this sentence contradicts itself.
A "license" implies that there is a copyright holder who allows usage of the work under the terms of said license.
While "Public domain" implies that there is no copyright holder (e.g. because the copyright expired, was explicitly waived, or is for some other reason not applicable).
If you want to put your work in the public domain, you can do so; simply include a note saying that you dedicate it to the public domain.
I think if there were an option to add a machine learning clause and ask individual creators if they wanted it applied in that context, we would see a considerable amount of uptake. It's just that we couldn't forsee this progress happening so soon, and the issue is still not visible enough. I think it's only a matter of time before the culture catches up and new creative works in the coming years are excluded from training sets by their authors with clear and direct language.
By that point there would be no way to argue "but they shouldn't care, they licensed it like this, so I'm assuming it's fine for ML use."
If copyright is not enough to stop another entity from using a person's data for training, then some other protection should be invented that does.
Copyleft exists for a reason and without the ongoing fight for the commons we lose it all.
That seems totally inconsistent with decades of people clamouring for more openness/liberty when it comes to IP rights.
> public domain
These are incompatible concepts. RMS's vision of 'free-as-in-freedom' software doesn't let people do whatever they want. It forces those who distribute binaries to also distribute source. This is not possible with a public domain work.
It feels morally wrong to me that I can spend thousands of hours working on projects on my own free will but then a company can sell the code I wrote to others in the form of snippet completion as a service. In fact they end up selling your code back to yourself if you plan to use the service.
If the answer is no, that moves the needle pretty far in the direction where I'd at least consider the idea of moving all of my repos to Gitlab. I don't care much about stars or popularity. I open source things that are interesting and useful to me and if other folks want to use it they can but I don't gain motivation from others using the projects I release. I like Github and its UI and it's no doubt "the spot" for open source but selling code written by others rubs me the wrong way a lot. It stinks because it also means no longer contributing to other code bases too. It's moving us in the opposite direction of what open source is about.
I don’t buy that producing/synthesizing code snippets based off public repos is a problem.
There’s nothing proprietary or original about eg. the syntax of a for-loop, or the boilerplate of setting up some JS framework MVC.
Besides, it’s basically just a (semantic and contextual) search engine inlined within the IDE. Copyright infringement hasn’t taken place until the user activated the autocompletion and actually placed the code within their own and released their code containing the infringing code.
Being a top CoPilot contributor should at least have value to signal on your resume.
I don't think there is a way to opt out if it is a public repo regardless of license, and Microsoft's copyright theory suggests that they wouldn't feel obligate to enxclude any code they got their hands on except under a specific NDA preventing such use; the use of public GitHub repos isn't based on legal constraints but practical convenience.
If you read free code yourself it’s fine, but if a machine does it for you it’s not? We overvalue humans.
I'm curious - what are the legal implications of this going forward? I've so many questions.
1. Will Microsoft ever face lawsuits for these license violations?
2. If so, who/how? Class-action?
3. Will copilot be forced to open-source in the future? Under which license? Some open source licenses are incompatible with others, but copilot uses code from probably every OSS license conceived.
4. If Microsoft faces no justice, will we start seeing more OSS license violations? Will Google start using AGPL-licensed code?
[1] https://news.ycombinator.com/item?id=27710287 | Copilot regurgitating Quake code
"jethrodaniel" does not appear to have the copyright to offer that license, but it's hard for Github to determine that in general, so I doubt they would be liable for the error.
But even ignoring that, everybody uploading code to GitHub has given GitHub the right to analyze that code as per the GitHub ToS. This is the same mechanism by which you can't upload code to GitHub with a license that says "nobody is allowed to display this code on the internet" and then sue GitHub.
5. Even if it is illegal, is it actually bad? No one can possibly sell code snippets, the transaction costs are many orders of magnitude greater than any reasonable price. In my opinion, at least in this case the benefits massively outweigh the costs and the law should not apply here.
My question is, if it isn't a copyright infringement issue to use copilot in its current form right now, why not just claim copilot was used whenever accused of copyright infringement hence forth?
That same Quake example from last year is repeated every single time.
Aside from the fact that GitHub has since added a protection for this, that this example gets repeated time and time again instead of a *list of examples leads me to believe this is (and was not) a common occurrence.
2) TBD
3) Not likely. Worst case a judgement will go against them, they'll effectively pay a fine and then they'll retrain it on a more restricted set of source code.
4) OSS has a pretty tragic history re: enforcement. It wins nearly every skirmish but has no interest in the war so from a big picture standpoint, it loses due to apathy.
No book, painting, codebase, sonnet, design is theft-less.
The art is the space reduction, otherwise we’d just bruteforce away.
The Magnificent Seven for instance was a reworking of Seven Samurai, but stands on its own as an original creation. Going into a cinema and filming a picture to later put on a torrent site is not artistic reworking.
The hard discussion is about what is acceptable, we all know prior art exists.
If that happens, the big copyright/IP conglomerates will immediately jump on that and make sure that laws are adjusted and they get their cut of every single word and line anyone puts near their smartphones ;)
That applies to everything, its even a basic law of physics, and there's absolutely nothing wrong with it. Any layperson already knows what a remix is anyway so not sure what you think will change
Would you describe a parody, or a critique/review, as equally without original merit?
And I'm sure I couldn't disagree with you more. Or are 'influence' and 'theft' the same now?
There's already a way to quickly solve the boring parts in development - libraries which were built and licensed around that purpose. But Copilot passes you code of unknown origin, with unknown license terms and no information about how close it is to an existing codebase. It's like a person trying to sell you Macbooks for a hundred bucks per unit but you don't know where they came from and who made the holiday photos stored on the harddrive.
No matter how interesting your problem is, translating it into code is going to involve a lot of grunt work. This isn’t just boilerplate, but also the large portion of your code which is going to be gluing things together.
The time you spend working through those menial parts of your code is time when the context of the interesting part of the problem fades. Once you get the mechanical stuff out of the way, you have to load the interesting stuff back into your brain.
This is where AI coding tools really shine. They dramatically reduce the intervals between when you can think about the actual problem you’re solving by letting you get the boring mechanics out of the way more quickly.
Way too often I burn half an hour needlessly during review in one of two ways:
* trying to figure out how the heck someone figured out some "magic" code that achieves something by invoking a bunch of poorly documented library or framework internals, and trying to reverse engineer WTF all the magic does by diving into the framework's source... only to eventually think to google the whole snippet rather than each individual method call, and discover it's copied from a Stack Overflow answer
* trying to figure out why something was written in an unidiomatic or overcomplicated way rather than a more obvious approach, and commenting at length on how I'd simplify it... only to eventually realise it was copied from a Stack Overflow answer
Attribution isn't just about making sure the right person gets credit, or about license compliance; reviewers and maintainers frequently need to be able to see where stuff was copied and pasted from in order to do their jobs effectively, even for snippets of just a few lines.
Basically, the project was to package Cordova + Backbone + Marionette, plus a couple of tools, under their own commercial name. Then they'd go around potential clients presenting it as the perfect solution to build hybrid applications for web/mobile/smartTV/whatever.
A certain Monday, the "architect" arrived boasting. He did that often, but this time he was more boastful. He explained that he had spent the whole weekend coding. He had written an incredible tool that would create a skeleton for a project from zero. You would type something like `tool create` and it would create the whole project with all the scripts and some example views and whatnot.
It was Yeoman's yo CLI tool, of course. He had just changed the copyright in the comments, removed most of the comments, he had deleted any mention to yeoman or the original creators, changed the name of the executable script and that's it.
The whole thing was OS code picked up from various repos and packaged as their own. The company used it to sell development projects. The so-called-architect used it to sell himself inside the company and then jump away into a startup as CTO.
Is this common or is it just anecdata? I don't know. It's clearly not the only time I've seen something like this and I do know that in certain companies around here it isn't exactly uncommon. But I can't say how common or uncommon it is.
Would I call this "selling other people's code"? Yes, I would.
If the solution was copied from an OSS project without proper attribution? Yes. Absolutely. And they'd have words with a senior dev and maybe even legal if the code they copied made its way into production without attribution.
Many copyleft OSS licenses require attribution and distribution of derivative works that we wouldn't allow.
I’d also expect for any stack overflow code to include a comment with a link to the stack overflow page.
I think one of the key points is to make sure any code taken from another source is cited appropriately. If it isn’t, or the junior dev is passing it off as their own work, then we have problems.
This isn't derived from principled reasoning, but I think of it as similar to community norms. Not the best example, but you wouldn't mind someone subletting their homes to Airbnb, but if all of your apartment complex does it, it invites regulation. A product like copilot enables copying code (even if inspired, and not verbatim) at a scale that individual developers can't. So respecting software licenses needs to be codified (legally?) while previously it was left unmonitored.
If you are using it to write whole complex functions thatare the same as other people's, I guess that is copying.
But if you do the second thing you are not a great dev, and would have probably ended up copy pasting it anyway.
I think the first use case is far more common, and creating boilerplate that is so generic you could never really attribute it anyway.
If you do that on your own, it's your (legal) responsibility. If Copilot does it for you, it's GitHub's/Microsoft's responsibility.
This whole thing would be fine if GitHub hadn't just used all public code on their platform, ignoring all involved licenses.
It's like saying GPT-3 created text is copyright infringement, because some author used the same sentence in a book before.
> But if you do the second thing you are not a great dev, and would have probably ended up copy pasting it anyway.
How would I know that the boiler plate I ask copilot to write for me is copied verbertim from a codebase, that neither I nor Microsoft has licensed to use?
Copilot is not writing your code any more that Google search is writing your code. You are writing your code, and Copilot is just making suggestions.
US constitution secures limited copyright to "To promote the progress of science and useful arts". Copilot is just that, get over it!
The amount copied from any particular source might be small, but an aggregate strip-mining of many copyrighted sources is an interesting twist. Another might be, as you suggest, it might be a machine that itself does not violate copyright, but has the effect of causing users (who accept the suggestions) to violate copyright.
2. Copy&Pasting Code by manual search exists.
3. This is just a smart tool so you don't have to figure out yourself what to copy&paste (in the best case) and save a lot of time.
Sometimes I truly wonder how people can genuinely be upset about things like this. What is broken are copyright and patent laws in the 21st century.
I guess Copilot could address this by checking the licenses of the projects it uses. Even when combining code, it could pull in the required attribution or avoid GPL licensed code (unless enabled) for example.
Tell me you regularly plagiarize without telling me you regularly plagiarize.
Are you saying that I would need all the original authors consent to upload a repo to github even if I include all the original attribution and licenses? Because what you are implying is that when uploading I'm granting github a license far outside the bounds of the license included, which only all the contributors can do. For example, would the linux project need to contact each and every contributor ever to upload a mirror to github, since their contributions were under GPL but you are implying that the license given to github is much, much broader?
This would make any project not originally started on github and with a few contributors basically impossible to host there.
> 2. Copy&Pasting Code by manual search exists.
The question is who is doing the infringement here. Github copilot is obfuscating the copying and telling it's users that the code is theirs to use, own, etc. as they please but is also taking large chunks of code it does not have the right to redistribute, even less grant licenses to.
90% of Twitter is just inventing new ways to whine about things
Well, that's already the case with Stack Overflow copypasta enterprise code. If anything, use of Copilot would be an improvement...
I feel this is more a meme, rather than reality. I do check StackOverflow, but never have I took an answer verbatim. I try to see if it's the same problem and what was the approach in deconstructing it, which I find more useful in the long run.
What do you mean, Copilot regularly pastes stuff directly from SO. One of those automatic doc generators was able to point me to the exact answer where one of them was from.
You could think of the evolution of practical problem solving in software engineering like this:
1. I have to invent a solution (because nobody else in the world has a computer) 2. I have to know of a solution (education, word of mouth...) 3. I have to look up a solution in the books I have (commoditized knowledge) 4. I can look up solutions on the internet <-- (we are here) 5. The computer suggests something and I accept (some are here too)
From 1 to 4 the amount of cleverness required to solve small problems drops a bit, but your productivity and exposure to knowledge probably goes up.
I'm not quite sure what happens from 4 to 5. Personally I'm actually more interested in the context solutions are presented in than just the solution. In fact, I rarely copy and paste code from the Internet, but I often look at multiple suggestions/solutions and then borrow ideas or combine ideas from several sources.
Now, I'm not entirely sure how necessary this was from a legal perspective. But introducing an AI into the mix will bring up a lot of uncertainty when it comes to how much change is required for something to no longer be considered a copy/derivative.
One might argue copilot puts into software an algorithm that humans are already doing. Software like that is usually inevitable.
Still, it sucks there's no benefit for the contributors.
The most ethical thing I can think of is some kinda 'Spotify-like' revenue sharing model, based on how often their code is used by others. Not that they'd ever implement that if they can get away with it!
That argument only works if you think what Copilot is doing is meaningfully similar to what humans are doing. The debate about how these models relate to human thought might have legal implications.
As I understand it (IANAL) copyright doesn't protect ideas and concepts. It protects the content itself. In theory, if I read some copyrighted work, understand some idea in it and then create a new work using that idea, without copying that original work, then that is not a derivative work. (I think this is at least how it's supposed to work - would love to be corrected if that's wrong.)
So if I took a copyright work and rot-13ed it before distributing copies, I think that would be clear copyright violation, but if I made my own works using concepts I gleaned from reading it, it wouldn't be.
So should Copilot be treated like the rot13 algorithm or like me understanding concepts and generating new works using them? That sounds like a fascinating legal debate to be had.
When following the license terms, preserving the original copyright, etc, sure.
However, honest, ethical people (including programmers) do not plagiarize.
Copying and pasting code without attribution is plagiarism. Doing it without following the licensing terms is a copyright violation.
Based on my understanding of how NNs work, I'm not sure its even possible to implement something like that.
I like the idea of having the bot automatically update a attribution file if it detects it’s used licensed code. Seems like it would be fairly trivial. Also a robots.txt for repo owners to control automated use.
Also, they should totally pay back a portion of revenue to the community and support the repos used to train. That seems like it would be a good PR move if nothing else.
The idea of auto-attribution if copilot surfaces licensed code is best because then it keeps the copilot user honest where the code is coming from and honor the original license.
Aren’t they already doubling all Github sponsorship money?
Does anyone even know? Can we even check? What if 1 in a thousand, or one in a million outputs is (very close to) something existing? I find this especially relevant when generating faces.
String getName() {
return name;
}
Let us also assume that this snippet, unsurprisingly, has been in several copyrighted repos that didn't grant Github the right to share this code.So I start tying "getName" and copilot suggests the exact snippet above. If I use this snippet, is it plagiarism? Even though the above code is the most "obvious" way to write this getter and I would have written it this way even without copilot's suggestion? Or does the "uniqueness" or "non-trivial quantity" of the suggestions have any bearing in determining copyright violation? How/where do we draw the line?
- respect attribution
- respect copyleft
- respect proprietary licences
- give the user appropriate hints about the above
Or does it just copy code without doing any of this?
Someone creates content for free, and companies monetize it.
Or we as a community need to create a better bsd, a cc0 for everything.
Almost everything is nontrivial, and almost everything is copyrighted, at least with the pressure to name the original author (BSD, GPL, other major permissive licenses).
Say you want to use a library, then you check for examples in the documentation, now you have to denote somewhere that the example is from the documentation (best if you put it in the source code, so you don't lure other people to copy what you copied and refer you as the author).
It is a major PITA at least for me.
If you have ideas, code, music or art which you wish for noone to partake in, do your best to keep them secret. Certainly, breaking into secret areas should be illegal, but once the cat gets out of that bag it gets out of the bag.
The creative people behind these ideas I believe will be able to find good compensation nonetheless in society, IP-laws nowadays only serve to protect megacorporations to the detriment of creativity and ideas.
No, of course not. Joyce's literature was influenced by Ibsen, Mozart looked up to Haydn, Newton was humble enough that he openly professed he stood on the shoulders of his predecessors, Perelman refused the Millennium prize because it wasn't also offered to his colleague Hamilton.
All human innovation is iterative, and derivative. https://www.youtube.com/watch?v=jcvd5JZkUXY
Our skill doesn't grow in vacuums, without outside mentorship and guidance. There are areas where I am upset about the application of AI, but this is not one of them. Consider copilot a gentle guiding hand for those without access to a second pair of eyes nearby to give you reminders on what you may otherwise have on the tip of your tongue.
But in the way that Led Zeppelin refused to recognize how heavily their music was influenced by delta blues artist was unbecoming, I can accept the argument that it is perhaps douchey of Github to sit on Copilot as squarely their creation.
Of course, if you provide already a copyrighted prefix, and it has seen that code, the chances are high that it would complete the copyrighted code (because that is what you actually would also expect).
So, for real use cases in the wild, where you write some own real novel code, how often would it suggest some copyrighted code? And how often would a human?
I have used Copilot the last months and I have never ever seen such a case (I can be pretty sure because all the identifier names are really unique, and the code was very custom).
However, I assume that I myself might have produced copyrighted code unknowingly because if you write common patterns (e.g. some tree or graph search, or some sort function, implement LSTM or Transformer, whatever), the chances are not so low.
Then there are cases where it amazes me completely, it wrote 10 lines of C++ code for rendering a monochrome glyphs with bits using Freetype library. It though had odd subtle bug, the glyphs came reversed and it worked with only certain font size which it seemed to pick up from different file all together.
I suppose it should also generate appropriate copyright notices to satisfy many open licenses. I'd be surprised if copilot could actually link back to the original code like that, though.
So what? Selling code other people wrote is the foundation of the free software movement. It is the entire business model of countless companies, and it is a good thing. Among them are most major linux distro vendors like Red Hat and Canonical.
The value added by Copilot is that they sell you the lines "code other people wrote" you want out of billions.
I still think it is derivative work, and that they should only process code under permissive licenses, or, if they want to include GPL code, make a GPL-only version, usable only for GPL projects. I thought it is what they did, there is so much code under permissive licenses that is should be enough to train their model, but apparently, they don't care, as long as it is public, it is included. For me, they are shooting themselves in the foot, several companies have already banned Copilot due to the potential issues with copyright.
I can't speak to the legal side, but I just don't understand the moral outrage over very occasionally copying such short snippets of code. The key innovations and the actual value that licenses are intended to protect aren't in these short snippets.
And what does copilot bring to the community? Free use by students, free use by open source maintainers, and a huge boost in productivity for a modest fee for professional devs, for a service that no doubt costs a lot to run, even on the margin.
Its more of a licensing issue to me. As far as I can tell it was train on a blend of licenses which to me makes it inherently non-compliant. At least some of it is going to be copyleft and find its way into closed source.
It is putting a new spin on some traditional Open Source Lessons (https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar#L...).
People share and reuse snippets of unattributed snippets of MIT-licensed and GPL-licensed code on the internet all the time, StackOverflow, etc.
StackOverflow is profiting from that activity indirectly by facilitating it. They profit passively through ad revenue, and actively through the Teams subscription offering.
But nobody seem too upset about that.
How is an AI which facilitates the same code sharing fundamentally any different? Because it’s scraping it itself, rather than humans contributing it?
Seems like a tenuous argument at best.
https://huggingface.co/spaces/mullikine/ilambda
Language models are able to 'steal' the linguistic meaning-making 'essence' of the software, by modelling:
- How the software is used (mimicing its function) - external meaning
- How functions are 'inspired' - internal meaning (reflection)
https://github.com/semiosis/imaginary-programming-thesis
The models themselves should be clear about where the data came from. However, this is only possible in a fair world which we do not live in. Compromise must be made to protect national interests.
Generative models are license blind and there's very little that could be done to prevent progress. Like what the invention of the camera has done for art.
Large language models including Codex are a transformative technology.
Bi-directional fair-use is probably the best result we can hope for.
So long as Microsoft and OpenAI are not selling back usage of the model to the open-source community, I think it's OK, though it's the bare minimum obligation.
> the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time
It's insane how vague this is. Is Copilot a "Service"? Sure, by its definition:
> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.
And since much of the code was published before Copilot's inception, this means Github can just arbitrarily add more "services" and milk the code for whatever it wants. Automatically service-ify any public repository? Sure, pay us for quotas. It's like a legal loophole to let Github just bypass any license restrictions you put on it.
Humans make original patterns, but since Copilot cannot think, then Copilot does not. It squashes together a bunch of small individual patterns, each under their own license, but at no stage does it do anything more than pick a line from here, and a line from there.
It doesn’t think, and it doesn’t create new IP.
It is like making a picture out of small snippets of a thousand other pictures, and then selling it.. clearly not OK. You still ripped off the original artists.
Or like plagiarising 100 of your class mates’ assignments. Are you less guilty because you went to the effort to steal just a few sentences from each?
A criminal who steals a cent from every account at the bank is a more sophisticated thief than someone who holds up a petrol servo.
If Copilot doesn’t create new IP (it doesn’t; we established this), then it uses existing IP. And in that case it is no different to any of the three analogies above.
There's going to be some big cases here. It's going to end up in the Supreme Court sooner or later, and if it were to go there today I think I know what they'd say.
Source: https://twitter.com/mitsuhiko/status/1410886329924194309
> For example, it would probably not be a fair use to copy the opening guitar riff and the words “I can’t get no satisfaction” from the song “Satisfaction.”
I wonder how this could be integrated into the system?
[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/...
https://felixreda.eu/2021/07/github-copilot-is-not-infringin...
There's a good argument that demanding copyright protections on scraped datasets and short snippets is a double-edged sword. It could harm search engines, distribution of news, and non-commercial ML research too.
It seems akin to trying to copyright a certain drum pattern or chord progression.
Also, the history of the GPL, MIT, commercializing lisp machines, Symbolic, infighting, etc… seems a very different context than Copilot so I am having difficulty seeing the systemic problems that tools like this encourage.
There is of course a surface level similarity in that a corporation is profiting from IP in the public domain but the devil is in the details.
I highly recommend everyone read it.
So even without using someone else code, just the pattern understanding and the production of simple boiler plate code is great.
I think it much more likely that they count on everyone liking it way too much to give a shit about their MIT code not being attributed correctly.
I certainly don’t. MIT just seems like the most convenient license for people that need licenses (corporations?), so that is what I use.
I don't think that's bad though. Code sharing is good for overall productivity.
But as far as I understand, GitHub trained Copilot on any public repository on GitHub, meaning even if it doesn't have a license specified (so the user publishing it still has the copyright to it), then I don't see how it can be OK.
'I don't agree with having an AI trained on/with my data.'
IMHO, all other problems with copilot stem from this.
Who is on the side of open source? Where are the big, powerful institutions and companies that deeply care about authors and communities providing free software that so many of us rely on?
...the above was generated by GPT-3 (text-davinci-002). Prompt: Write an argument for why using open-source code to train an AI and then sell the code generating service (without open-sourcing it) is ethical.
The main argument against this is that it takes away from the open-source community that contributed to the development of the code in the first place. By selling a code-generating service without open-sourcing it, the company is profiting from the work of others without contributing back. This is unfair and takes away from the overall open-source ecosystem.
Added two characters to the prompt :P
It's SO frustrating that even on HN people still fall for this naive and incorrect analysis. Pasting bits I've said before on this topic:
Language models do not work like this. They can copy content but usually that's for something like the GPL language text.
Generally they work on a character by character basis predicting what is the most likely character to appear next.
This very rarely results in copying text, and almost never rare text.
Mechanically it has learnt both syntax of language and how concepts relate. So when it starts generating it makes sentence that are syntactically valid but also make sense in terms of concepts.
That's really different to just combining bits of sentences, and it gives rise to abilities you wouldn't expect in something just cutting and pasting bits of sentences. For example, few shot learning is mostly driven by its conceptual understanding and can't be done by something with no way to relate concepts.
If yes, how do they mitigate the risk of exposing private data when something is quoted verbatim?
If not, then why are repos with non permissive licenses ok?
Copilot is limited to public code now, but it may easily be trained on non-public code - albeit this probably won't be for sale to the public.
My code is on Github so that people can read it, reuse it and learn from it. "The freedom to study how the program works", as the FSF says. If some of the people reading it are machines, why would that matter?
[1] http://steve-yegge.blogspot.com/2010/07/wikileaks-to-leak-50...
Absolute nonsense.
Wild.
(I'm actually not being sarcastic, I think there needs to be some sort of pipeline for compensating the artists who are used to train these models
So it should be as simple as if you're using other people's content for your own profit you should properly compensate them.
Or we could just abolish copyright law and assume that everything humans create emanates from culture so its always collectively built and everything should be open source.
Or we just do the same we've been doing. Create even more complex laws trying to define this fuzzy line in a way that companies can keep profiting from it a lot more than individuals.
:)
We all Duck/Google for code anyway. Why not admit and make it easier?
I do not see Copilot as useful anyway.
I don't use github. Can someone explain what the author means?
Edit: in detail
The whole thing is just bizarre when the vast majority of developers constantly look at OSS code daily and lift ideas/patterns/snippets from there regularly without once looking at whatever license is attached.
One could argue that the information density in chord progressions, bass lines and beats is extremely small. And that any recognizable part of a musical idea that has been "borrowed" would necessarily make up a larger percentage of the complete work than would be the case for a typical application with borrowed snippets.
That's not a bad argument, but it is unsatisfactory because it means that at some point someone has to make a judgement on how much you can borrow.