I don't buy the argument that the risk of a yet-to-be-litigated case against a different company, who will certainly fight this hard; is greater than the productivity gain of using copilot.
Additionally, the security argument feels ridiculous to me. We lift code examples from gists and stackoverflow ALL THE TIME! But any good dev doesn't just paste it in and go, instead we review the code snippet to ensure its secure. Same thing with copilot, of course its going to write buggy/insecure code, but instead of going to stackoverflow for a snippet its suggested in my IDE and with my current context.
I posit it becomes increasingly likely over large periods of time over many engineers that severe bug or security issue will be introduced via an AI provided suggestion.
This risk to me is inherently different than the risk accepted that engineers will use bad code from Stack Overflow. Even Stack Overflow has social signals (upvotes, comments) that allow even an inexperienced engineer to quickly estimate quality. The amount of code used by engineers from Stack Overflow or blogs etc, is much smaller.
Github Copilot is constantly recommending things and does not gives you any social signals lower experienced engineers can use to discern quality or correctness. Even worse, these are suggestions that are written by an AI that does not have any self-preserving motivations.
In IntelliJ, disabling auto complete just requires clicking on the Copilot icon in the bottom and disabling it. Alt+\ will then trigger a prompt. I know there's a way to do this in VSCode as well, but I don't know how.
I experienced this in practice. I was pairing with an inexperienced engineer who was using Copilot. He was blindly accepting every Copilot suggestion that came up.
When I expressed doubt in the generated code (incorrect logic + unnecessarily complex syntax), he didn't believe me and instead trusted that the AI was right.
> Github Copilot is constantly recommending things
It's only a momentary problem, will be fixed or worked around. And is this a bad thing to get as many suggestions as you could? I think it's ok as long as you can control its verbiage.
> does not gives you any social signals
I don't see any reason it could not report on the number of stars and votes the code has received. It's a problem of similarity search between the generated code and the training set, thus finding attribution and having the ability to check votes and even the license. All doable.
> an AI that does not have any self-preserving motivations
Why touch on that, people have bodies and AIs like Copilot have only training sets. We can explore and do new things, AIs have to watch and learn but never make a move of their own.
AI can also do code review and documentation helping us reduce the number of bugs. Overall it might actually help.
THIS is the article I want to read!
I'll go one further with the "Co-pilot is stupid."
It's supposed to be artificial intelligence. Why in the eff is it suggesting code with a bug or security issue? Isn't the whole point that it can use that fancy AI to analyze the code and check for those kind of things on top of suggesting code?
Half-baked.
If this is a real issue the solution is not banning yet another tool. It’s education. Teaching engineers how to properly understand code attribution and licenses.
The article suggests that he wants to know "who wrote the code" if a senior dev he trusts submits a PR. He doesn't want to be surprised that "the AI" wrote some of this code.
But its ALL written by the senior dev. If he trusts that dev, that means that dev has thoroughly read and tested his code! That's the important bit. Remembering proper syntax/imports/nesting levels is the tiniest piece of writing good code. And copilot can take that off our hands.
For me the bar is higher - it's not that I wouldn't understand it, it's that its easier to miss mistakes when reviewing than when writing from scratch. In the same way you may have ignored the typo in this comment and understood what I meant regardless of the mistake. But that doesn't work for programming - a mistake is a mistake and likely matters in edge cases even if it's not immediately obvious.
50? 25?
I'll bet the people spinning cotton thought that would endure forever.
(Sorry if my tone comes across as fervent. I'm excited to be displaced by this, because what follows is the stuff of dreams.)
How does your company's general counsel feel?
This article is written at CTOs, not engineers.
May be different for a larger, value-preserving company who would face more scrutiny.
That being said, I still find it extremely unlikely that there would be legal ramifications from using a product being pushed by one of the largest software companies in the world. Why go for a user and not Microsoft themselves?
I don't find the natural language comment to buggy algorithm part of Copilot to be particularly useful. I know some people asked to be able to write a "DoWhatIMean(), method, but programmers really only wanted that to auto-expand to "protected virtual void DoWhatIMean() {}" without having to wait 30 seconds to check for a compile error and see if it was protected void virtual or protected virtual void...
Copilot is so much beyond regular autocomplete that it's playing a completely different game.
I've been using it today while writing a recursive descent parser for a new toy language. I built out the AST in a separate module, and implemented a few productions and tests.
For all subsequent tests, I'm able to name the test and ask Copilot to write it. It will write out a snippet in my custom language, the code to parse that snippet, and construct the AST that my parser should be producing, then assert that my output actually does match. It does this with about 80% accuracy. The result is that writing the tests to verify my parser takes easily 25% of the time that it has when I've done this by hand.
In general, this is where I have found Copilot really shines: tests are important but boring and repetitive and so often don't get written. Copilot has a good enough understanding of your code to accurately produce tests based on the method name. So rather than slogging through copy paste for all the edge cases, you can just give it one example and let it extrapolate from there.
It can even fill in gaps in your test coverage: give it a @Test/#[test] as input and it will frequently invent a test case that covers something that no test above it does.
I used it for about a month. It gave me a few false positive that really burned me - it's not worth the risk. Maybe future versions would be better.
What happened to burn you so badly?
Agreed it gets things wrong very frequently. But I've found it much easier to use its suggestion as another "input" to writing code.
That's minutes of work, maybe even 10 minutes, turned into seconds. That is huge.
The risk here is extremely low. Who is going to sue consumers of Copilot? It makes no sense. They'll sue Microsoft and, in a decade, we'll see if the win or lose (IMO Microsoft will win, but it's not important).
I'm not claiming that you can't get big productivity boosts by ripping off code like a crazy person. I bet you can! But should you?
It is not a theoretical concern
Doesn't matter. A developer's speed and test completeness and code quality matter not one whit when it comes to licensing. That 10x developer could mire the company in fines and code re-writes if they include copyrighted code, especially if it's not OSS.
copilot actually can have the benefit here of being able to retroactively mark some snippet as insecure, if it gets flagged as such by the moderators. Any user who used it could get automatic notification.
But why optimize a non-critical path?
That said: I generally agree with the assessment. Github should at the very least be telling users when it is generating code that they trained on. Until it does that, it's kind of dangerous to use. The security stuff is imo more of a red herring.
But the more important point is that you can just wait a year and hire a consultant to build a better product (for you) at pretty low cost. Within a year, any organization with a non-trivial number of developers will have the option of hosting their own model trained on The Stack (all permissively licensed) and fine-tuning it on their internal code or their chosen stack. That's probably the best path forward for most organizations. If you can afford $7/dev/month for Slack-integrated nannybots you can definitely afford to pay a consultant/contractor to setup a custom model and get the best of both worlds -- not giving MSFT your company's IP while also improving your dev's productivity and happiness beyond what a generic product could deliver.
But now I realize I like that a lot more than being aware that the article I'm reading is going to push me to take an action (start a discussion with my team) and a probable outcome is "enforce no Co Pilot on company machines".
Sneaky! Good catch. Article should have a disclaimer at the bottom
I don't use Copilot when writing languages I am very comfortable with because I'd rather write code that I completely understand. Or at least, understand to the best of my ability. I find it easier to consider edge cases and side effects when writing original code. Or at least, compared to reading someone else's that was ripped from a project you don't even know the goals of. I don't buy that Copilot improves productivity for this reason as well.
I also avoid using Copilot when writing in languages I am unfamiliar with because I feel like it's robbing me of a learning experience. Or robbing me of repetition that improves my memory of how to do various things in the language.
I don't know. Copilot is certainly impressive but there are too many questions - what I've mentioned and the legal ones in the OP. But perhaps that is a good thing? It is a new angle on copyright that we're going to have to answer one way or another. In programming and other fields.
This might not be that big a percentage of my actual work, but in terms of motivation it enables me to work using TDD without the friction of writing boilerplate, which in turn makes programming much more fun.
Also, and this is a big one, you can directly ask questions and it replies. I find that fascinating.
This morning I asked him “why didn’t you test for [specific thing]?” And it replied “because I don’t know how to properly mock [name of a library I was using]”.
Yesterday, while bored in a meeting, I asked copilot whether a coworker’s proposal for an OKR was good and it replied “it’s ok, but keep in mind that it’s a lagging metric”. It’s scary.
Now that I didn't know!
Just because something looks similar or is even identical doesn't mean copyright applies.
That horrifying back and forth showed that lawyers can consider very small and obvious fragments of code to be absolutely copyrightable. And that it went on for nearly a decade, should tell you that none of this is simple.
Take care making public statements like this if your work is highly attributable / traceable.
I think I'm deliberately ignoring the tortured argument you're trying to make that copliot is similarly unethical - which is just ridiculous. It deserves to be ignored.
2) It might be fair to allow authors to submit repos, along with some sort of 'proof of ownership' to Copilot, in order to exclude them from the training set. There might have to be an documented (agreed-upon?) schedule for 'retraining', in order for the exclusion list to take effect in a timely manner.
3) Or just allow authors to add a robots.txt to their repos, which specifies rules for training.
Just a few thoughts...
If a permissive, biz-friendly license (Apache 2.0, maybe others) is found in a given Repo, then it can be used in training set
Otherwise, the repo cannot be used in a training set
What I think is more concerning is that copilot is an extension of effectively automatic copying stuff from stack overflow with even lesser understanding of what the code does by the prompt writer.
Do not get me wrong. I absolutely see the benefits, but the risk listed in the article seems less material than a further general decline in code quality. "Built by a human" may need to end up being a thing same way "organic" became a part of daily vocabulary.
What happens when this extends to the "specialists" that blindly copy code off Stack Overflow? What happens when this becomes part of learning to program? Will it be as useful for producing working, efficient code when used by people who don't know what they're doing?
The part that I find unsettling when using Copilot is the risk that credentials or secrets embedded in the code, or being edited in (.gitignore'd) config files, are being sent off to Microsoft for AI-munging and possible human review for improvements to the model.
Since Copilot is constantly making new suggestions, a momentary entry is all it takes.
I needed to run a comparison over a window of a numpy array, and given the sheer size of my data, I needed it to be fast and efficient, which means vectorized operations with minimal python interaction. Copilot figured a solution that is orders of magnitude faster than what I could conjure up in 10 minutes, most of which I'd spent searching for similar solutions in SO.
Here is an example of a license that attempts to directly prohibit training. The problem is that you can imagine such software can't be used in any part of a system that might be used for training or inference (in the OS, for example). Somehow you need to additionally specify that the software is used directly... But how, what does that mean? This is left as an exercise for the reader and I hope someone can write something better:
The No-AI 3-Clause License
This is the BSD 2-Clause License, unmodified except for the addition of a third clause. The intention of the third clause is to prohibit, e.g., use in the training of language models. The intention of the third clause is also to prohibit, e.g., use during language model inference. Such language models are used commercially to aggregate and interpolate intellectual property. This is performed with no acknowledgement of authorship or lineage, no attribution or citation. In effect, the intellectual property used to train such models becomes anonymous common property. The social rewards (e.g., credit, respect) that often motivate open source work are undermined. License Text:
https://bugfix-66.com/7a82559a13b39c7fa404320c14f47ce0c304fa...How much hubris we have as a species to think that our professions will endure until the end of the stars. To think that the software we write will be eternal.
The thing that we do now is no different than spinning cotton.
I'd be shocked if the total duration of human-authored programming lasted more than a hundred years.
I'll also wager that in thirty years, "we'll" write more software in any given year than all of history up until that point.
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Presumably, as long as GitHub Copilot:a) fails to respect these itself, or
b) present the user that is going to use its output verbatim or produce derivative code from it so that the user can respect these
Then GitHub Copilot is either in violation of the license or a tool assisting in such a violation by stripping the license away†.
From TFA:
> David Heinemeier Hansson, creator of Ruby on Rails, argues that the backlash against Copilot runs contrary to the whole spirit of open source. Copilot is “exactly the kind of collaborative, innovative breakthrough that I’m thrilled to see any open source code that I put into the world used to enable,” he writes. “Isn’t this partly why we share our code to begin with? To enable others to remix, reuse, and regenerate with?”
I don't mean to disrespect DHH, but the "spirit of open source" isn't to wildly share code around as if it were public domain, because it is not, an author gets to choose within which framework their code gets to be used and modified††, otherwise one would have used public domain as a non-license + WTFPL for those jurisdictions where one can't relinquish their own creation into public domain.
† depending on whether the "IA"/Microsoft can be held liable of the automated derivative, or if the end user is.
†† cue GPL vs MIT/BSD
All this "controversy" around Copilot just reeks of a kind of technological "social justice" that most people didn't sign up for but seem happy to sit, watch, and commiserate on.
People want to use it, but are extremely worried about getting in hot water for using it. Thats no idle concern.
It’s very reasonable to ban its use within a company given the legal limbo.
Exactly.
You are free to view yourself as a code production machine, where what you produce is independent of the situation before and after you make anything, but many of us would like to take ownership and action on the legal, economic, and social planes with our work.
Laws can be slow to catch up, that's a feature. The legislature and courts exist for a reason. You can argue they're moving so slow it's better to ignore them. But that introduces a very real liability, at least until you can convince a court or elect representatives.
this means the output of an AI isn't even considered "work" in the eyes of copyright law, if I am understanding. if the output of copilot is not a "work" then the output of copilot cannot be a "derivative work," and cannot violate copyright.
Courts have repeatedly found that only stuff humans create can be copyrighted.
If GitHub Copilot can't produce a work, it can't violate copyright; only human operators can do that.
This makes the recent GitHub announcement about coming Copilot features make much more sense legally: features which show the origin of the suggestion and which allows a user to select which code licenses to use suggestions from. previously these seemed like things to appease critics, but they're tools to help paying copilot users know what code they're actually using. Nice.
IANAL.
anyway, this lawsuit is gonna fail so effin' hard. lol
[0]: https://www.theverge.com/2022/2/21/22944335/us-copyright-off...
Humans can create copyrighted works, and therefore can violate copyright.
Given these things, I fail to see how anyone at GitHub could get into any trouble, legally.
Thus, the suit against GitHub/Microsoft will fail, which was the point that I apparently failed to communicate clearly enough.
If anyone is liability-adjacent here, it is copilot users, who must always be mindful of what they write anyway.
The other, is that people might be using Github to share stuff they've come up with other developers, but having an AI parse that information means that there's a disconnect between giver and receiver. It removes a chunk of the feedback loop being possible, which makes it so rather than it being a community of developers, it becomes something more akin to content creators and lurkers. That's not necessarily a bad thing, due to it opening up the sheer number of possible usages that end up using something. But it would minimize community feedback.
Copilot suggesting it for inclusion into proprietary codebases is then effectively whitewashing that GPL code.
Copilot also does not provide attribution, which is a legal requirement of tons of permissive licenses, including MIT and BSD.
Step 1: "GitHub Copilot Bad.... amirite!>"
Snark aside, most of these articles miss the mark to the point where they seem like the author is tech illiterate and is just parroting soundbites from others' opinions.
Regardless, the risk need to be really big for me to stop using it. It's such an essential tool for me now that I get shocked how crippled I feel when internet stops working and I realize how much I depend on it.
Looks like Microsoft says burden is on CoPilot users to 'vet' the code.
Regarding someone else suing you because you used Copilot, their terms say:
> GitHub will defend Customer against any claim brought by an unaffiliated third party to the extent it alleges Customer’s authorized use of the Service infringes a copyright, patent, or trademark or misappropriates a trade secret of an unaffiliated third party.
Also: does Copilot work for Clojure and is it any useful for Clojure?
I sincerely hope Microsoft looses this law suit.
I don't see anything wrong with "stealing" code that was meant to be public
Banning it brings no value in compare to those tools.
Also, how is that different from Google's scraping whole internet?
Many reasons: Google Search provides sources, it links to your website. Copilot only gives the content. Google also doesn't suggest including its search results in your product verbatim.
Even if the ruling did go against them, it would likely still be acceptable to train the models since it would be the usage that would be legally suspect.
Which means that in five years anyone will be capable of running these models entirely off-line on their personal machines and no one will ever know the difference.
And that's just for GPL code. Code not under an OSS license could get way worse.
I assume you only ever use self hosted source control then? And then where is it hosted?
It's very easy to host your own GitLab server if you need a fancy web interface, and even easier to just put Git anywhere if you don't.
In general I want to write as little code as possible as more code = more problems. The code I do write I want to put great care and craft into in order to keep it maintainable. Giving up any of my agency in this critical area seems like a terrible idea to me.
Something that will help me write more code, or write code faster is of no benefit to me.
What's odd is that I'm noticing almost every single report of it being useful is from someone who is anti code licenses. Or rather, not that they're philosophically opposed, but they disregard licenses altogether because it benefits them and they can't be stopped. I've seen so few reports of usefulness coupled with legal or moral skepticism.