This tool remains the equivalent of money laundering for violation of Open Source licenses (or software licenses in general).
If the model was trained entirely using code that Microsoft has copyright over (for example: the MS Windows codebase, the MS Office codebase, etc.) then they could offer legal assurances that they have usage rights to the generated code as derivative works.
Without such assurances, how do you know that the generated code is not subject to copyright and what license the generated code is under?
Are you comfortable risking your company's IP by unknowingly using AGPLv3-licensed code [1] that Copilot "generated" in your company's products?
[1] https://en.wikipedia.org/wiki/GNU_Affero_General_Public_Lice...
This would not risk your company's IP.
To me this whole thing is like pandora's box and it will not in any way be put back into the box. In the long run isn't arguing about the code it generates and how it generates it mostly tilting at windmills? I've already met new / junior programmers that have used copilot and chatgpt to help them see how to approach certain problems or try to get better framing for what they couldn't quite get into the most accurate words to google.
I too would prefer these tools embody the ideal: no license violation, perfect citation of where the archetypes of the code came from. I've commented here today (amongst some great FOSS software engineers) to see if a genuine respectful conversation can be had about how just like torrents this one isn't going to be put back in the box no matter how many legal precedents attempt (or succeed) in cutting off heads of the hydra. It's utility seems like it will steamroll any attempts to stop or slow it down.
Am I wrong? Is it a fools errand to ask?
What? I don't see any utility outside of education and even there it's pretty sketchy.
For business, legal compliance is not a joke and instantly shuts it down. The only businesses willing to use ChatGPT for generating code would be naive young startups who don't realize some assembly is still required and the instructions are missing no matter how much they query the bot. That's called expertise (which they don't yet have). It's not good enough to just write the code. Someone has to comprehend it so they can tweak it as needed. At some point the tweaks will become unwieldy and require actual software engineering that the bot doesn't know how to do (transform from one design pattern to another and know which to use). More power to them if they can cobble something together and then succeed at maintaining it. By the time they're through they'll have pulled off so many miracles that they won't need the bot anymore and become experts. That's quite the trial by fire, but hey everyone has to find their way!
I'd have no objections to a tool that generated suggestions that came with attributions and license metadata, ready to insert into your project's file for third-party licenses. AI code suggestions are impressive.
I have objections to a tool that generates derived works from code without respecting the licenses of that code. For permissively licensed Open Source code, including that code without attribution deprives authors of their due credit (said credit often being how people get employment or funding). For copyleft Open Source code, including that code without using a compatible license violates the conditions upon which people made that code available for others to build upon and share. For proprietary code, including that code at all incurs legal risks.
We’ll hang back until other companies have litigated their way to some legislation around it.
No matter what is said, there are no license guarantees on the generated code, as you don’t know the exact provenance, so it seems only sensible to be on the safe side.
I would prefer to see full license attributions included in generated responses, though. Something that then also wouldn't be that difficult to generate a licenses file from?
Amazon's CodeWhisperer has a "reference tracker" that tells you the license of training data code if the generated response is within some similarity threshold, but that's still not good enough imo.
Exactly. By all means build tools like this, but build them to actually comply with Open Source licenses. Provide a list of the licenses you don't mind copying from, and get back attributions with your suggestions.
I don't think it's possible to do better than that with this technology.
The result looks like my own code and is utilizing the already existing parts of my application. The code it writes for me solves problems that you cannot find a standard solution for anywhere and is definitely not something that could be attributed.
How Copilot is trained is an issue but answering the question "where did this code come from and what license is it under" would be impossible.
Not at all. I'm saying it's derived from large amounts of code, without respecting the licenses on that code.
> How Copilot is trained is an issue but answering the question "where did this code come from and what license is it under" would be impossible.
Then it shouldn't exist outside of demos of what could exist in the future if the showstopper legal problem gets solved. Let's get people treating that constraint as business-critical and start coming up with clever solutions, and see how long "impossible" lasts.
I understand that some folks don't believe the same thing, and use copyleft licenses so that their code can't be re-used in a closed way, and that's fair. Github shouldn't be training their product on copyleft licenses.
It's fair to call out its misuse of certain licenses, but "the equivalent of money laundering for violation of Open Source licenses" is simply inaccurate, as many licenses allow this type of re-use explicitly.
If you do, then by all means they're welcome to use it without attribution or preservation of copyright notices, per the terms of the license you used.
But for all the Open Source code, even permissive Open Source code, that does require attribution or preservation of copyright notices, that's still a license violation. People don't often think of permissive Open Source licenses as something that can be violated, but they absolutely can be.
Double-checking whether the generated part is a verbatim copy negates the speed advantage.
Possible infringements from similarity are even harder to search.
Was looking for a way to instruct CodePilot to abide by the following rules:
- Only use Apache v2, MIT or BSD licensed work for its recommendations. (Or a specific license set)
- Only use code trained on public repositories.
- Provide code attributions of the source code where the recommendations originate from.
I'm not sure if the last point is possible given these GPT type architectures but it would really help during code reviews.
Even if you use code under these license, you are still supposed to credit the authors by reproducing the license. So you need to know where it came from. Do you credit all the software the model was trained on?
That's what a good chunk of people do anyway at work. No one really cares nor will care. We were already moving in that direction anyway this will just accelerate it.
disclaimer: i'm from the Codeium team. but really, we will even ship you a physical box if that level of data security is important to you
I'm sure it's prob still very useful for people who care about the privacy tradeoff, but I've had more success with ChatGPT
We are fans of ChatGPT and think that ChatGPT is pretty complementary to tools like Copilot and Codeium. ChatGPT is helpful for longer form exploratory questions from natural language while Codeium in its current form is great to accelerate your coding.
Without that, I can't even entertain the idea of using an AI code tool for anything but private projects that I don't share with anyone.
Edit: Also, Notepad++ support would be awesome
* Simple license management
* Organization-wide policy management
* Industry-leading private
* Corporate proxy support
Wow. Who’s going to pay a 90% premium for these features?
Edit: OK seems like different marketing pages have different features. The list above comes from https://github.com/features/copilot/. Still seems like a very steep increase over the base. And I cannot believe there are only 400ish companies using copilot.
From here: https://github.com/customer-terms/github-copilot-product-spe...
I can’t even imagine the hilarity that would ensue if I went to my GC right now to ask permission with so much in limbo; it’d be suicide by conference call.
> Copilot for Business does not retain any telemetry or Code Snippets Data.
https://docs.github.com/en/copilot/configuring-github-copilo...
I wouldn't risk it. It is too easy to write the wrong prompts and leak PHI.
ChatGPT:
"Write me a parser for this HL7 message..."
Copilot:
"Using this example message please write a parser for it..."
Yeah... If it was compliant, people would write those in a heartbeat.
Unless sold as HIPAA compliant, and the conditions of use for that compliance are known... don't trust it, for SAAS.
This is stuff covered in your yearly HIPAA briefing folks.
1. First, you really only need to worry about the specifics of HIPAA if you are a "covered entity" under the law (primarily a hospital or other healthcare provider, or a health insurer), or if you have signed a BAA with another company (more on that below). There are all sorts of misunderstandings that you can't, for example, say something like "Jane couldn't make the meeting because she's out with the flu" at a company - that's not how it works. Unless you're a covered entity, you're under no obligation to keep PHI private under HIPAA.
2. If you do work at a HIPAA covered entity, it usually is made explicitly clear where patient data is or is not allowed. Even if GitHub Copilot were "HIPAA Compliant", unless they signed a HIPAA BAA (business associate agreement) with your company, it's still not OK to send them any PHI.
Point being, there are plenty of reasons to be worried about customer privacy and data security, but people like to bring up HIPAA rules in lots of situations where they simply don't apply.
I doubt any company would use this in their production code. Internal tools, maybe.
It's amazing at specific things, like being z context aware boilerplate generator or doing the scaffolding of an algorithm from a comment describing it.
That's really different than using libraries.
I am using it within JS/Vue and Python and really enjoy it a lot. You can write a simple comment like "upsert the object in the store, and if the item is incomplete recursively call a refresh fn until it is complete" and it will "magically" do the rest - down to understanding where the objects I am referring to are in my store, the attribute used to decide this ('completed_at'), down to the correct syntax for updating the data in a way that plays nice with vue reactivity.
It's also a stellar autocompleter in Python-land. I have been using more and more type annotations in my codebases, but even without that, it will usually guess the right attribute or function name.
I also dig the way it will automatically write a docstring for a function. It can sometimes be a great debugging method since I can just have it comment an undocumented fn to quickly glean what it is doing. I'll circle back to enhance the docstring usually too so it is less cookie cutter.
For writing blog posts it is really neat too, because it will help me write functions to illustrate certain points based on what I am trying to teach in the post. Most of the time I can come up with the most succinct and relevant example, but sometimes I cannot and this tool does a good job helping there.
It's not perfect, but the fact that I can type a few words, hit tab, and then correct any mistakes is really a magic experience sometimes.
Have any big companies set policies on employees using these kind of tools? Do they allow them?
My company (big UK-based tech company) had an “all employees” sort of e-mail saying the use of Copilot, ChatGPT et al was not allowed for anything work-related or using company equipment due to unclear licensing model of the generated code.
I find many of these blanket rules silly, but in this case I find it is sensible to wait before polluting our products with this autogenerated code.
I'd say no thanks. I think programmers using Copilot is paying for something that'll hurt them in the long run for a tiny benefit in the short run.
I don't trust Microsoft, and neither should you.