GitHub Copilot regurgitates valid secrets (opens in new tab)

(twitter.com)

201 pointspetulla4y ago143 comments

143 comments

109 comments · 29 top-level

iamlucaswolf4y ago· 18 in thread

What amazes me is how predictable(?) all of the recent issues were.

Don't get me wrong, the folks behind Copilot are clearly, without any doubt smart, creative, and capable. But then... None of these issues (reproducing licensed code ad verbatim, non-compiling code, getting semantics wrong, and now this) are 0.01% edge cases that take specialized knowledge to see or trigger. I remember some of them being called days ago in the initial HN thread by people who haven't had beta access.

I really wonder how this announcement/rollout looked like on the management side of things. Because a) these shortcomings must have been known beforehand and b) backlash from people who feel threatened for their jobs/"stolen" of their open source work was (I guess) foreseeable? I've already read calls to abandon GitHub for competitors; this can hardly have been an acceptable outcome here.

Nevertheless, Copilot is still one of the most innovative and interesting products I've seen in a while.

Olreich4y ago

I’d be very surprised if management at the least didn’t have their heads in the sand about the potential failure mode. There are often “must deliver” dates at large companies because someone made a promise about a deadline and now heads will roll for missing it whether anyone actually cares or not. So long as middle management thinks the C suite is watching them, they are desperate to meet quota.

Hilariously, this results in stuff like Copilot getting released to great big legal problems. Only then does the C suite actually notice the project and get upset that it is a legal nightmare for them.

I think the real secret to winning in big tech is that your job is just to keep your head down and keep the money rolling in without causing headaches for higher ups. Increase sales, make customers happy enough to keep paying, maybe release a cool product. But more importantly, don’t cause a major outage, burn the PR team, or get caught up in a legal kerfuffle.

rorykoehler4y ago

You make a good case for innovation through acquisitions rather than in-house development. Once the derisking aspect is factored in acquisitions suddenly look a lot more attractive.

rorykoehler4y ago

I saw this type of thing coming a mile away and left GitHub as soon as they were bought by Microsoft. TBH even despite my inherent distrust of Microsoft this is way beyond the hypotheticals I had in mind when I deleted all content from my GitHub account. Now I’m worried about VSCode as another potential vulnerability vector. Has anyone done a recent independent audit of what is sent across the wire to Microsoft from VSCode?

belter4y ago

Well they have Telemetry enabled by default so you should disable it:

https://code.visualstudio.com/docs/getstarted/telemetry#:~:t....

Maybe something else still goes over the wire...

visarga4y ago

Yes, it looks like unfinished work. They could have:

- implemented plagiarism detection to attribute code to its source (where possible), then present the result together with the link. This makes Copilot same with Googling your answer and then copy-pasting the code. You are fully responsible

- implemented some regexes to filter out secrets, or even better, change the secrets to random values in the training data

- implemented a robots.txt like system so people have a method to ban the Copilot spider from their code

If they did these things before release it would have been so much better. But they are simple fixes so I see no technical obstacle.

rattray4y ago

We should keep in mind that the product is still in beta / technical preview.

naniwaduni4y ago

Should we really be forgiving one of the world's richest corporations for launching a marketing campaign with expansive claims for a half-baked product because, in the fine print, they call it a technical preview?

5 more replies

iamlucaswolf4y ago

Absolutely. I also believe that Copilot is getting more flak than appropriate at the moment.

To rephrase my comment above: I don't want to blame the team behind Copilot for not getting everything right on the first try. Neither am I in a position to do so, nor would I want to live in a world where smart people aren't allowed to make mistakes.

What irritates me is that there are two possible scenarios here:

1) They knew about potential issues and decided to release it anyway (without at least addressing them verbally). 2) They didn't.

And frankly, I don't know which one I like less. Even though it's still a beta/preview, either option seems to signal a degree of negligence? that feels unnerving given the potential impact of such a system.

That being said, if we do live in scenario 1) than I am certain that better framing could have prevented the PR fallout that we're seeing right now (at least partially). IMHO, GitHub (the platform) is still a great product after all.

tasuki4y ago

If my product is in beta, is it ok for it to leak your secrets?

rst134y ago

Unfortunately this is something large corps like AWS have been getting away with for a while now. Releasing half-baked product clones as GA when in fact they're still clunky and are probably beta at max.

alkonaut4y ago

This is a good point. There is a lot of outrage now, but the product when finished might have every single wrinkle removed.

This one, for example, seems it should be pretty easy to fix. You could even make a hack that replaces ALL sufficiently long and sufficiently random strings with garbage/zeroes, at the point of recall. The difference from the case of regurgitating GPL sources is that the information that it looks like an API key can be deducedd from the output of copilot, so you don't need to track it through the system like you would with a system of attribution.

skywhopper4y ago

How do you tell a “long and random string” from a base64 encoded PNG file or embedded script or…

2 more replies

kalium-xyz4y ago

Is it more innovative than for example: tab9?

callamdelaney4y ago

There are things that already do what Copilot does, eg Kite so it's hardly innovative imo.

nicce4y ago

Kite does only fraction what Copilot is currently doing. It is great at suggesting function names and parameters, but it does not really suggest complete code or generate somewhat new code.

Iv4y ago

That's the norm for a Microsoft product. Sell something full of holes, deal with it only when it starts posing an existential threat to the product

syshum4y ago

I dont know if that true. Just because you are "smart, creative, and capable" does not mean you can predict every possible outcome or be incapable of missing the obvious

I have been on both sides of that, where I have had to point out obvious flaws in an idea to very smart people, and have had clearly obvious flaw pointed out to me in one of my idea's...

I think it is completely possible that some or even all of the issues co-pilot is facing were unknown at the time of release, even if they are obvious to some

tyingq4y ago

Though they could have proofed the small number of handpicked examples on copilot.github.com to see that they compiled/didn't blow up on first run. Or one further, that they did what they were supposed to, in a somewhat reasonable way.

the84724y ago· 8 in thread

On the other hand it also means someone checked those secrets into github somewhere, so they would also be retrievable with a classic search.

henearkr4y ago

Has Copilot been trained on private repos as well?

If so, it means that you wouldn't find them by a search, but they would still be revealed by the A.I.

the84724y ago

The question is easily answered by checking their FAQ:

> What data has GitHub Copilot been trained on?

> GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. It has been trained on a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.

henearkr4y ago

Thanks and sorry to be so lazy (facepalm)...

intricatedetail4y ago

Who can you verify it is true?

2 more replies

gitgud4y ago

Hmm, well they definitely crawl private repos for security vulnerabilities, you need to opt-out if you don't want them to...

So it's not too far-fetched, that they'd train AI on private repos data too...

karmasimida4y ago

You'd surprised how many people are using GitHub without any privacy consideration.

Any free text generation AI service has the same weakness, that you can't control what your model would spill out.

that_guy_iain4y ago

It's such an issue that I believe AWS scans every commit pushed for secrets and disables them.

I know because I accidentally pushed a secret myself. Mistakes like that can happen super easily when it's a simple project and you've not setup all your things correctly.

1 more reply

mrfusion4y ago

This should be the top comment. We’re just shining light on an existing issue.

MontyCarloHall4y ago· 7 in thread

Unintentional copyright violations and “leaking” of secrets people accidentally committed to public repos aside, my main issue with Copilot is that I don’t think it actually makes coding easier.

Everyone knows it’s usually far easier to write code than to read code. Writing code is a nonlinear process: you don’t start from the first character and write everything out in one single pass. Instead, the logic of the code evolves nonlinearly—add a bit here, remove a bit there, restructure a bit over there. Good code is written such that it can be mostly understood in a single pass, but this is not always possible. For example, understanding function calls requires jumping around the code to where the function is defined (and often deeper down the stack). Understanding a conditional with multiple branches requires first reading all the conditional predicates before reading the code blocks they lead to.

Reading, on the other hand, is naturally a linear process. Understanding code requires reconstructing the nonlinear flow though it, and the nonlinear thought process used to write it in the first place. This is why constant communication between partners during pair programming is essential—if too much unexplained code gets dumped on a partner, figuring out how it works takes longer than just writing it themself.

Copilot is like pair programming with a completely incommunicative partner who can’t walk you through the code they just wrote. You therefore still have to review most of it manually, which takes much longer than writing it yourself in the first place.

whydid4y ago

In theory, I agree with your point.

However, I've worked with people who struggle to write in English without introducing random punctuation, and don't "see" (or care about) the text on the screen enough to go back and fix it.

I think Copilot will be a great benefit to the lazy programmer, who understands the semantics, but just can't be bothered to get the indentation or other syntax correct.

MontyCarloHall4y ago

I 100% agree, but that’s exactly what code linters do, which have been around for decades.

That said, a more sophisticated linter might be useful in catching non-idiomatic, but syntactically/stylistically valid code that would thus be flagged as “valid” by current linters’ simple automata.

schwartzworld4y ago

Not sure what language you work in, but in JS we have tooling like prettier so we don't have worry about code formatting so much.

rorykoehler4y ago

My main concern is it will give non-technical managers funny ideas about what goes into writing code. Writing out what you want to do is the easy part. Figuring out how to do what you want to do in context of the broader environment is the main challenge and that requires time to think and reason.

masterphilo4y ago

Copilot, as far as I know, also does not seem to factor in the greater context of the application/code you're in when auto-completing these tasks.

To me, this is a huge part of modern-day development. It's not only about producing functionally correct code, but also code that integrates well and is semantically relevant to the broader context of the application itself.

That doesn't mean Copilot's input will have no value, but it just means that developers will generally need to refactor that code in a way consistent with the app they're building.

Olreich4y ago

I think if your code requires reconstruction of nonlinear pieces to read it, you’ve written bad code. Fundamentally, a program is a list of instructions for a computer to run. The more linear that list is, the more efficiently the computer can run it. Linear code is also much simpler to read and understand for us humans as you pointed out. Iterate on the code until you find the linear path through it, otherwise you’re going to be in a world of pain if you need to understand it in the future.

MontyCarloHall4y ago

In principle I agree, but it often can’t be avoided. Function calls, loops, and even things as trivial as Boolean short circuiting are all examples of essential, unavoidable nonlinear code.

sergiomattei4y ago· 7 in thread

Called it.

nojito4y ago

The keys would be sniffed out by secret scanner anyway

https://docs.github.com/en/code-security/secret-security/abo...

Doesn’t make sense to implement fixes on CoPilot.

atraac4y ago

The stuff you write doesn't have to be commited to GitHub(or am I missing something?) so this argument makes no sense. Copilot clearly scanned and autocompleted third party secrets, it's in no way acceptable behaviour.

dolmen4y ago

The Copilot FAQ has mentions about its training set.

https://copilot.github.com/

nojito4y ago

It's trained on public repos.

Public repos are scanned by the secret scanner. Why is it CoPilot's responsibiity to also scan the keys too?

BtM9094y ago

Can you get these keys by searching via a search engine?

karmasimida4y ago

CommonCrawl contains considerable amount of PII information if you really spend time digging, for example.

erikrothoff4y ago

Yes. There are quite a few services that scan Github and in realtime show leaked secrets. I once found someones Gmail password that way...

s_gourichon4y ago· 6 in thread

It does not generate secrets. The Twitter conversation does not mention that word. Most certainly, it regurgitates secrets it has seen on crawled repos. Can the title be adjusted, please?

chrisseaton4y ago

I think ‘generates’ means produces this output in this context. It’s the correct term for the technology being talked about. Nobody is confused that it’s producing new cryptographic tokens.

wccrawford4y ago

Actually, reading the title, I wondered how it could possibly do just that.

But it's not generating the secrets, it's generating code that contain those secrets.

playpause4y ago

I was confused. "Copilot generates valid secrets" sounds like it can be used to generate new secrets that are valid in format or something. The headline is misleading, even if you personally weren't misled. The secrets are not being generated, they are real secrets appearing in generated code.

1 more reply

s_gourichon4y ago

So, Co-pilot generates text including secrets it has seen.

For a comparison point, afl fuzzer actually generates never-seen, valid JPEG files "out of thin air", see http://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thi... and matching conversation https://news.ycombinator.com/item?id=8571879

It would be interesting to combine both approaches and (interrupted by the sudden singularity ;-).

numpad04y ago

Agreed. This is just another examples of apparent reality that Copilot is obfuscated database than magical generalization of programming concepts

dang4y ago

Sure. It's a good excuse to get the word regurgitate in an HN title.

lokl4y ago· 5 in thread

In light of this, unless there is evidence to the contrary, I'm going to assume it will also regurgitate malicious code.

qayxc4y ago

> I'm going to assume it will also regurgitate malicious code.

Well if it wouldn't replicate or even randomly generate malicious code, it would imply that CoPilot would somehow be able to solve Halting Problem or - at the very least - understand intent and purpose of both its output and training material.

Keep in mind that the very definition of "malicious code" is highly subjective, plus the intent and purpose aren't necessarily encoded in the program itself. If the latter were the case, there would be no need for documentation, requirements or specs.

lokl4y ago

When I say "malicious code," what I really mean is some well-known patterns of malicious code, not all malicious code in general. Just like we are surprised about "secrets" being regurgitated when we mean "API keys."

lennoff4y ago

Oh, it does... It generates code that's vulnerable to SQL injections, XSS, etc. It was trained on such code! (hopefully this will improve)

lokl4y ago

Training on code that unintentionally has vulnerabilities is a problem, but I'm even more worried about bad actors intentionally putting code with vulnerabilities on GitHub with the hope that it will become training data. Bad actors might learn how to disguise code to sneak it into Copilot (if disguise is even necessary) and introduce backdoors, etc. It could be especially dangerous because of the "stamp of approval" Copilot has from GitHub/Microsoft. People who would not copy/paste code from the web might feel a false sense of security using Copilot.

TeMPOraL4y ago

I totally expect this will happen. I bet this is already happening. As long as they continue to train Copilot on third-party code that wasn't thoroughly, manually vetted by them, this vector of attack will remain open - and mitigating it falls into the domain of... spam filtering.

phumberdroz4y ago· 4 in thread

The only time I would think this is a valid security issue if those were tokens that were previously not public. But that should not be the case right?

e12e4y ago

Sure, if someone checked in a secret to a repo that at some point was public, and got crawled by co-pilot - they should cycle that secret, so it's no longer valid - rather than only mark the repo private and/or nuke the secret from the repo history.

But there's another side to this - if you write code using co-pilot against a popular Api - and co-pilot gives you a valid key - and you access data or a system you aren't supposed to - would you be liable under the various draconian antighacker laws?

If you pick up a key card from the street, and enter someone's home - you'd be trespassing after all..

phumberdroz4y ago

That is a good question and I think you should be. After all you are still the Person that writes and produces the code just with the help of a tool. Similar to a lockpick. (I hope that makes sense)

holstvoogd4y ago

Lets hope so... I expect that these were accidentally committed to a public repo

However, while the keys are then already leaked, you'd have to go search for them. Copilot suggests you use them in you editor. That is not quite the same imo.

It goes from deliberately searching and using leaked keys to having them handed to you without context. I feel it is a bit like finding an unlocked bike, if you take it, it is still stealing. But here there is a guy at the bike parking lets say that is handing out bikes to anyone passing by. Not the best analogy, but i think it covers my point ;)

phumberdroz4y ago

I think it would be more like a friend telling you to take the bike or saying it is his bike and you can take it for a ride.

But yes I get your point but I also believe people still need to apply some sense to what co pilot suggests.

aloisdg4y ago· 4 in thread

This is why environment variables exist

corobo4y ago

Definitely true but in this case it does add another point against "it generates code, not copies it"

the84724y ago

Deliberate, repeated prompt engineering to make it regurgitate something is not proof for the general case. The same way that humans being able to recite text when asked to is not proof that they're incapable of creating new content.

corobo4y ago

To clarify my issue was not that it outputted quake prompted -- course that's obvious, it's clear the AI had no intention of outputting the code until the user provided function Q_

The issue is that it's impossible to tell if it's done that or not short of googling each line it comes out with

Anyway I've determined my reactions are now based on disappointment rather than thought, a bias that should be factored in

rightbyte4y ago

Which you put in a startup.sh, which you later commit. With a lesson learned you add those to git ignore. Speaking for a friend.

llimos4y ago· 4 in thread

I do feel for the people behind Copilot, even though they'll have known it was coming. They produce something absolutely frggin' amazing that can change the world and for the next few days all everyone does is pile on and pull it to pieces... yes of course these are valid issues but can we please look at the big picture and appreciate what an achievement this is?

cdrini4y ago

I agree with this; I'm very confused by what appears to be a very strong visceral reaction to this experimental feature. I don't know what impact it will have on programmers/programming, but I'm curious to see. Personally, I see something like copilot as a terrific search engine. Searching for code in Github is kind of difficult; being able to search by writing a descriptive comment is really cool!

esailija4y ago

It's only because people like you pretend this glorified markov chain is something more that is causing everyone to pile on. So stop.

treesprite824y ago

> you pretend this glorified markov chain is something more

I hate this belittling attitude that HN has towards projects in certain fields. It's like if I dismissed any progress in graphics as "just glorified triangle-drawing".

TomSwirly4y ago

> They produce something absolutely frggin' amazing that can change the world

It won't change nothing. It's madlibs for code. The 90% of programmers who are mediocre will drown out the 9% who are competent and the 1% who are talented.

beshrkayali4y ago· 3 in thread

It's really kind of comical at this point. The more this copilot bs continues to be a thing, the more it's making Github seem irresponsible/careless at best.

e3bc54b24y ago

The cynic in me is thinking that marketing folks at Microsoft/Github are giggling endlessly at all the stories giving them free and extreme publicity. This is enforced by the recent post by Github 'analyzing' the copilot's code regurgitation, instead of retracting and retraining it on more defined subset of codebases.

This thing is worthless at best, annoying at everything and terrifyingly capable of destroying every programmer's productivity at worst. But it will stay, it will grow because stupid execs will keep dreaming of replacing their engineering talent with this, and Microsoft will laugh all the way to the bank.

beshrkayali4y ago

There's no way they didn't expect some backlash on this. So I think it's partly for the marketing gimmick. I'm sure there are people at Github tho who really think they're making something valuable unfortunately. Sadly, what they're not aware of is that they've become part of the experiment themselves.

It's like expecting that an AI trained on Shakespeare novels would be able to "help" a writer write Shakespeare-like novels. Sure, they might get something that might fool some people, but are they a writer? I think software is a lot more like "writing" than it is like "building".

What mostly annoys me is that this is a win-win for github regardless of the outcome. If people buy into it (even for just a while, and it currently seems like some really smart people are buying into it) they'll carve a huge piece of a new market. If it fails, they'll make it seem like an experiment into the whole ethical gray area of what should and shouldn't be used for training, and that they just wanted to draw attention to it.

WesolyKubeczek4y ago

Just until the software at the bank is rewritten using CoPilot and suddenly they can’t withdraw any money.

rsynnott4y ago· 3 in thread

I'm kind of astonished that this project got greenlit, given Microsoft's previous experiences with embarrassing AI projects (thinking particularly of Tay and Zo).

blagie4y ago

Microsoft isn't a person. Individuals at organizations make decisions, and it's very possible the person greenlighting this never heard of Tay and Zo. It's almost certain they weren't the same person.

account424y ago

You think it's reasonable for someone greenlighting projects at MS to not familiarize themselves with releated previous projects?

2 more replies

loloquwowndueo4y ago

What about Microsoft Bob :)

cabirum4y ago· 2 in thread

Can we please stop (mis)using the term "AI"? It just does not live up to most people's expectations.

Copilot is a glorified Markov chain autocomplete sitting on a huge dump pile of data. It is not aware of constructs such as "licenses" or "secrets" most people would have expected from AI. To prevent it from spilling secrets everywhere, a developer ~~should teach the AI a concept of secrets and the meaning of licenses.~~ has to implement a filter. A regexp-based one will do, I guess.

armatav4y ago

Yeah, this. Modern AI is not AI - AI is synthetic machine life and essentially always has been, in fiction and non-fictional idealism.

Deep neural networks have stuck us in hope-fueled uncanny valley, and very smart people tend to become very confused about their technology when they're subject to it.

This technology has its place about heuristic programming, definitely, but is not AI.

TeMPOraL4y ago

Someone will definitely be quick to reply with the "AI is a moving goalpost, things stop being called AI when they work, for example..." - so I'll offer my counterpoint up front. These things were never AI except in marketing lingo and in connection to the research in machine learning. The common folk definition of AI doesn't change - it's still the same vision of computers in science fiction, with which you can converse, and which can think better than you (except in some specific ways in which are super-dumb - this is necessary for the story to have any plot).

2 more replies

notimrelakatos4y ago· 2 in thread

I look forward to the bright future were I have to maintain messy code from AI Rockstars.

weird-eye-issue4y ago

This is exactly what I'm afraid of. If people use this tool early in their journey of learning programming it will do them a big disservice.

avipars4y ago

The next generation of script kiddies

WillDaSilva4y ago· 2 in thread

I don't consider this a problem. Copilot was trained on public repos, so these secrets had to be checked into public repos. They were already totally public, and should have been invalidated/replaced and redacted. Copilot might result in previously undiscovered published secrets being found, but that's not much worse than anyone finding one under normal circumstances.

intricatedetail4y ago

Have they produced evidence it was trained only on public repos? They should release the model and tooling so we can verify that.

treesprite824y ago

The easiest way to test this would probably be to try to get it to generate code/secrets that appear only in private repos.

rvz4y ago· 2 in thread

So GitHub Copilot has inherited all the bad practices of many StackOverFlow and GitHub side projects and generates them in front of you as 'assistance'.

All the API keys are still working and who knows, someone might complain about a huge fee right in here because they forgot to revoke it. Only time will tell.

I am certainly going to avoid this contraption. No thanks and most certainly no deal.

Downvoters: So are you saying GitHub Copilot DOES NOT do the following:

   Leak working API keys in the editor.

   Generate broken code AND give you the wrong implementation if you add a single typo?

   Copy and regurgitates copyrighted code verbatim.

   Guesses 1 out of 10 tries.

   Send parts of your code when you type in the editor.

Are you VERY sure?

dolmen4y ago

> Leak working API keys in the editor.

The content of your editor is sent by Copilot to its cloud service (the FAQ says: "The GitHub Copilot editor extension sends your comments and code to the GitHub Copilot service"). So yes any editor content is leaked, including sensitive information.

But is this content sent to other Copilot users? AFAIK, no. The FAQ mentions the OpenAI Codex as the training set.

https://openai.com/

dolmen4y ago

I just had a look at the OpenAI website.

They advertise an English-to-French translation service powered by AI. But it appears that nobody who is French native has even reviewed the service presentation. When the marketing material is just a joke, what can you expect in production if as a customer you use the service?

Just this example tells me much about the internal organization of the company.

e12e4y ago· 1 in thread

> SendGrid engineer reports API keys generated by the AI are not only valid but still functional.

> GitHub CEO acknowledges the issue... still waiting for them to pull the plug

I agree this is an issue for co-pilot as well - but it's really on send grid to invalidate keys that are known to be leaked?

Yes, that's inconvenient for the affected customers - otoh they won't get billed for other people's usage - or dinged for someone spamming using their keys...

karmicthreat4y ago

Most sendgrid customers are in their public IP pool. They don't tolerate spam on those IPs because its difficult to manage with the spam lists. So they are definitely proactive about killing leaked keys.

I leaked one when I accidentally left one in a repo I was making public. Took 15 minutes for sendgrid to drop it after putting the repo public.

aritmo4y ago· 1 in thread

It is one thing to put by accident your API key on your public Github repository.

And it's another (bigger) issue for Copilot to pick up that API key and put it in someone else's project.

blagie4y ago

No. Your public leak is the bigger issue.

Copilot is an issue, but I would assume malicious agents are already doing copilot-like things, just not in the open.

intricatedetail4y ago· 1 in thread

Grand Source Code theft. A permanent stain on Github?

They should scrap it and Microsoft should be ordered to sell Github because they have a conflict of interest.

For example Microsoft has access to your private repos and can do things like co pilot with your data. Who knows maybe your code powers Windows 11 now.

easton4y ago

Just like the conflict of interest they’ve had for 20 years selling TFS/Azure DevOps?

0x04y ago

Any chance Copilot could be made to cough up the DVDCSS or BluRay AACS DRM secrets?

sputknick4y ago

I see this as a problem with the developers who are committing code, and not a problem with Copilot. if you make your secrets accessible then they might be accessed. Also if you are rotating your keys regularly that would also mitigate these issues. This is a problem with humans failing to execute known security best practices, not malicious AI doing something insidious.

fxtentacle4y ago

If Copilot was trained only on public repos like they claim, then shouldn't those API keys already be disabled due to existing secret scanning tools?

For example https://docs.github.com/en/code-security/secret-security/abo...

The fact that Copilot recreates API keys that still work makes me wonder if they come from a semi-public place, because SendGrid is usually quite fast at blocking API keys that were accidentally made public.

speedgoose4y ago

People put valid secrets in their public repository all the time.

Just a quick search:

https://grep.app/search?q=%28secret%7Capi%29_%3Fkey%5Cs%3A%3...

tyingq4y ago

I wish he would have tried to track down if the keys were in a public repo before asking Sendgrid about them. If they turned out to be only on Github private repos, that would be new and interesting info.

Not saying putting keys in a private, but 3rd party hosted repo, is a terrific idea.

mvolfik4y ago

https://web.archive.org/web/20210705123028/https://twitter.c...

> COPILOT SECURITY BREACH

> SendGrid engineer reports API keys generated by the AI are not only valid but still functional.

> GitHub CEO acknowledges the issue... still waiting for them to pull the plug or make a comment. :popcorn:

Quoting https://twitter.com/pkell7/status/1411058236321681414

Paradigma114y ago

I really dont think that stuff hosted in public repos can be classified as secrets.

villgax4y ago

We are in a time where you have a crystal ball/chip & who can whisper sweet nothings & get back answers

input_sh4y ago

There truly is an XKCD for everything: https://xkcd.com/2169/

ibraheemdev4y ago

Was the tweet taken down?

tgsovlerkhgsel4y ago

Tweet taken down (?), does anyone have a mirror?

1 more reply

j / k navigate · click thread line to collapse

143 comments

109 comments · 29 top-level

iamlucaswolf4y ago· 18 in thread

What amazes me is how predictable(?) all of the recent issues were.

Nevertheless, Copilot is still one of the most innovative and interesting products I've seen in a while.

Olreich4y ago

rorykoehler4y ago

You make a good case for innovation through acquisitions rather than in-house development. Once the derisking aspect is factored in acquisitions suddenly look a lot more attractive.

rorykoehler4y ago

belter4y ago

Well they have Telemetry enabled by default so you should disable it:

https://code.visualstudio.com/docs/getstarted/telemetry#:~:t....

Maybe something else still goes over the wire...

visarga4y ago

Yes, it looks like unfinished work. They could have:

- implemented some regexes to filter out secrets, or even better, change the secrets to random values in the training data

- implemented a robots.txt like system so people have a method to ban the Copilot spider from their code

If they did these things before release it would have been so much better. But they are simple fixes so I see no technical obstacle.

rattray4y ago

We should keep in mind that the product is still in beta / technical preview.

naniwaduni4y ago

5 more replies

iamlucaswolf4y ago

Absolutely. I also believe that Copilot is getting more flak than appropriate at the moment.

What irritates me is that there are two possible scenarios here:

1) They knew about potential issues and decided to release it anyway (without at least addressing them verbally). 2) They didn't.

tasuki4y ago

If my product is in beta, is it ok for it to leak your secrets?

rst134y ago

alkonaut4y ago

This is a good point. There is a lot of outrage now, but the product when finished might have every single wrinkle removed.

skywhopper4y ago

How do you tell a “long and random string” from a base64 encoded PNG file or embedded script or…

2 more replies

kalium-xyz4y ago

Is it more innovative than for example: tab9?

callamdelaney4y ago

There are things that already do what Copilot does, eg Kite so it's hardly innovative imo.

nicce4y ago

Kite does only fraction what Copilot is currently doing. It is great at suggesting function names and parameters, but it does not really suggest complete code or generate somewhat new code.

Iv4y ago

That's the norm for a Microsoft product. Sell something full of holes, deal with it only when it starts posing an existential threat to the product

syshum4y ago

I dont know if that true. Just because you are "smart, creative, and capable" does not mean you can predict every possible outcome or be incapable of missing the obvious

I have been on both sides of that, where I have had to point out obvious flaws in an idea to very smart people, and have had clearly obvious flaw pointed out to me in one of my idea's...

I think it is completely possible that some or even all of the issues co-pilot is facing were unknown at the time of release, even if they are obvious to some

tyingq4y ago

the84724y ago· 8 in thread

On the other hand it also means someone checked those secrets into github somewhere, so they would also be retrievable with a classic search.

henearkr4y ago

Has Copilot been trained on private repos as well?

If so, it means that you wouldn't find them by a search, but they would still be revealed by the A.I.

the84724y ago

The question is easily answered by checking their FAQ:

> What data has GitHub Copilot been trained on?

henearkr4y ago

Thanks and sorry to be so lazy (facepalm)...

intricatedetail4y ago

Who can you verify it is true?

2 more replies

gitgud4y ago

Hmm, well they definitely crawl private repos for security vulnerabilities, you need to opt-out if you don't want them to...

So it's not too far-fetched, that they'd train AI on private repos data too...

karmasimida4y ago

You'd surprised how many people are using GitHub without any privacy consideration.

Any free text generation AI service has the same weakness, that you can't control what your model would spill out.

that_guy_iain4y ago

It's such an issue that I believe AWS scans every commit pushed for secrets and disables them.

I know because I accidentally pushed a secret myself. Mistakes like that can happen super easily when it's a simple project and you've not setup all your things correctly.

1 more reply

mrfusion4y ago

This should be the top comment. We’re just shining light on an existing issue.

MontyCarloHall4y ago· 7 in thread

whydid4y ago

In theory, I agree with your point.

However, I've worked with people who struggle to write in English without introducing random punctuation, and don't "see" (or care about) the text on the screen enough to go back and fix it.

I think Copilot will be a great benefit to the lazy programmer, who understands the semantics, but just can't be bothered to get the indentation or other syntax correct.

MontyCarloHall4y ago

I 100% agree, but that’s exactly what code linters do, which have been around for decades.

schwartzworld4y ago

Not sure what language you work in, but in JS we have tooling like prettier so we don't have worry about code formatting so much.

rorykoehler4y ago

masterphilo4y ago

Copilot, as far as I know, also does not seem to factor in the greater context of the application/code you're in when auto-completing these tasks.

That doesn't mean Copilot's input will have no value, but it just means that developers will generally need to refactor that code in a way consistent with the app they're building.

Olreich4y ago

MontyCarloHall4y ago

In principle I agree, but it often can’t be avoided. Function calls, loops, and even things as trivial as Boolean short circuiting are all examples of essential, unavoidable nonlinear code.

sergiomattei4y ago· 7 in thread

Called it.

nojito4y ago

The keys would be sniffed out by secret scanner anyway

https://docs.github.com/en/code-security/secret-security/abo...

Doesn’t make sense to implement fixes on CoPilot.

atraac4y ago

dolmen4y ago

The Copilot FAQ has mentions about its training set.

https://copilot.github.com/

nojito4y ago

It's trained on public repos.

Public repos are scanned by the secret scanner. Why is it CoPilot's responsibiity to also scan the keys too?

BtM9094y ago

Can you get these keys by searching via a search engine?

karmasimida4y ago

CommonCrawl contains considerable amount of PII information if you really spend time digging, for example.

erikrothoff4y ago

Yes. There are quite a few services that scan Github and in realtime show leaked secrets. I once found someones Gmail password that way...

s_gourichon4y ago· 6 in thread

It does not generate secrets. The Twitter conversation does not mention that word. Most certainly, it regurgitates secrets it has seen on crawled repos. Can the title be adjusted, please?

chrisseaton4y ago

I think ‘generates’ means produces this output in this context. It’s the correct term for the technology being talked about. Nobody is confused that it’s producing new cryptographic tokens.

wccrawford4y ago

Actually, reading the title, I wondered how it could possibly do just that.

But it's not generating the secrets, it's generating code that contain those secrets.

playpause4y ago

1 more reply

s_gourichon4y ago

So, Co-pilot generates text including secrets it has seen.

It would be interesting to combine both approaches and (interrupted by the sudden singularity ;-).

numpad04y ago

Agreed. This is just another examples of apparent reality that Copilot is obfuscated database than magical generalization of programming concepts

dang4y ago

Sure. It's a good excuse to get the word regurgitate in an HN title.

lokl4y ago· 5 in thread

In light of this, unless there is evidence to the contrary, I'm going to assume it will also regurgitate malicious code.

qayxc4y ago

> I'm going to assume it will also regurgitate malicious code.

lokl4y ago

lennoff4y ago

Oh, it does... It generates code that's vulnerable to SQL injections, XSS, etc. It was trained on such code! (hopefully this will improve)

lokl4y ago

TeMPOraL4y ago

phumberdroz4y ago· 4 in thread

The only time I would think this is a valid security issue if those were tokens that were previously not public. But that should not be the case right?

e12e4y ago

If you pick up a key card from the street, and enter someone's home - you'd be trespassing after all..

phumberdroz4y ago

That is a good question and I think you should be. After all you are still the Person that writes and produces the code just with the help of a tool. Similar to a lockpick. (I hope that makes sense)

holstvoogd4y ago

Lets hope so... I expect that these were accidentally committed to a public repo

However, while the keys are then already leaked, you'd have to go search for them. Copilot suggests you use them in you editor. That is not quite the same imo.

phumberdroz4y ago

I think it would be more like a friend telling you to take the bike or saying it is his bike and you can take it for a ride.

But yes I get your point but I also believe people still need to apply some sense to what co pilot suggests.

aloisdg4y ago· 4 in thread

This is why environment variables exist

corobo4y ago

Definitely true but in this case it does add another point against "it generates code, not copies it"

the84724y ago

corobo4y ago

To clarify my issue was not that it outputted quake prompted -- course that's obvious, it's clear the AI had no intention of outputting the code until the user provided function Q_

The issue is that it's impossible to tell if it's done that or not short of googling each line it comes out with

Anyway I've determined my reactions are now based on disappointment rather than thought, a bias that should be factored in

rightbyte4y ago

Which you put in a startup.sh, which you later commit. With a lesson learned you add those to git ignore. Speaking for a friend.

llimos4y ago· 4 in thread

cdrini4y ago

esailija4y ago

It's only because people like you pretend this glorified markov chain is something more that is causing everyone to pile on. So stop.

treesprite824y ago

> you pretend this glorified markov chain is something more

I hate this belittling attitude that HN has towards projects in certain fields. It's like if I dismissed any progress in graphics as "just glorified triangle-drawing".

TomSwirly4y ago

> They produce something absolutely frggin' amazing that can change the world

It won't change nothing. It's madlibs for code. The 90% of programmers who are mediocre will drown out the 9% who are competent and the 1% who are talented.

beshrkayali4y ago· 3 in thread

It's really kind of comical at this point. The more this copilot bs continues to be a thing, the more it's making Github seem irresponsible/careless at best.

e3bc54b24y ago

beshrkayali4y ago

WesolyKubeczek4y ago

Just until the software at the bank is rewritten using CoPilot and suddenly they can’t withdraw any money.

rsynnott4y ago· 3 in thread

I'm kind of astonished that this project got greenlit, given Microsoft's previous experiences with embarrassing AI projects (thinking particularly of Tay and Zo).

blagie4y ago

account424y ago

You think it's reasonable for someone greenlighting projects at MS to not familiarize themselves with releated previous projects?

2 more replies

loloquwowndueo4y ago

What about Microsoft Bob :)

cabirum4y ago· 2 in thread

Can we please stop (mis)using the term "AI"? It just does not live up to most people's expectations.

armatav4y ago

Yeah, this. Modern AI is not AI - AI is synthetic machine life and essentially always has been, in fiction and non-fictional idealism.

Deep neural networks have stuck us in hope-fueled uncanny valley, and very smart people tend to become very confused about their technology when they're subject to it.

This technology has its place about heuristic programming, definitely, but is not AI.

TeMPOraL4y ago

2 more replies

notimrelakatos4y ago· 2 in thread

I look forward to the bright future were I have to maintain messy code from AI Rockstars.

weird-eye-issue4y ago

This is exactly what I'm afraid of. If people use this tool early in their journey of learning programming it will do them a big disservice.

avipars4y ago

The next generation of script kiddies

WillDaSilva4y ago· 2 in thread

intricatedetail4y ago

Have they produced evidence it was trained only on public repos? They should release the model and tooling so we can verify that.

treesprite824y ago

The easiest way to test this would probably be to try to get it to generate code/secrets that appear only in private repos.

rvz4y ago· 2 in thread

So GitHub Copilot has inherited all the bad practices of many StackOverFlow and GitHub side projects and generates them in front of you as 'assistance'.

All the API keys are still working and who knows, someone might complain about a huge fee right in here because they forgot to revoke it. Only time will tell.

I am certainly going to avoid this contraption. No thanks and most certainly no deal.

Downvoters: So are you saying GitHub Copilot DOES NOT do the following:

   Leak working API keys in the editor.

   Generate broken code AND give you the wrong implementation if you add a single typo?

   Copy and regurgitates copyrighted code verbatim.

   Guesses 1 out of 10 tries.

   Send parts of your code when you type in the editor.

Are you VERY sure?

dolmen4y ago

> Leak working API keys in the editor.

But is this content sent to other Copilot users? AFAIK, no. The FAQ mentions the OpenAI Codex as the training set.

https://openai.com/

dolmen4y ago

I just had a look at the OpenAI website.

Just this example tells me much about the internal organization of the company.

e12e4y ago· 1 in thread

> SendGrid engineer reports API keys generated by the AI are not only valid but still functional.

> GitHub CEO acknowledges the issue... still waiting for them to pull the plug

I agree this is an issue for co-pilot as well - but it's really on send grid to invalidate keys that are known to be leaked?

Yes, that's inconvenient for the affected customers - otoh they won't get billed for other people's usage - or dinged for someone spamming using their keys...

karmicthreat4y ago

I leaked one when I accidentally left one in a repo I was making public. Took 15 minutes for sendgrid to drop it after putting the repo public.

aritmo4y ago· 1 in thread

It is one thing to put by accident your API key on your public Github repository.

And it's another (bigger) issue for Copilot to pick up that API key and put it in someone else's project.

blagie4y ago

No. Your public leak is the bigger issue.

Copilot is an issue, but I would assume malicious agents are already doing copilot-like things, just not in the open.

intricatedetail4y ago· 1 in thread

Grand Source Code theft. A permanent stain on Github?

They should scrap it and Microsoft should be ordered to sell Github because they have a conflict of interest.

For example Microsoft has access to your private repos and can do things like co pilot with your data. Who knows maybe your code powers Windows 11 now.

easton4y ago

Just like the conflict of interest they’ve had for 20 years selling TFS/Azure DevOps?

0x04y ago

Any chance Copilot could be made to cough up the DVDCSS or BluRay AACS DRM secrets?

sputknick4y ago

fxtentacle4y ago

If Copilot was trained only on public repos like they claim, then shouldn't those API keys already be disabled due to existing secret scanning tools?

For example https://docs.github.com/en/code-security/secret-security/abo...

speedgoose4y ago

People put valid secrets in their public repository all the time.

Just a quick search:

https://grep.app/search?q=%28secret%7Capi%29_%3Fkey%5Cs%3A%3...

tyingq4y ago

Not saying putting keys in a private, but 3rd party hosted repo, is a terrific idea.

mvolfik4y ago

https://web.archive.org/web/20210705123028/https://twitter.c...

> COPILOT SECURITY BREACH

> SendGrid engineer reports API keys generated by the AI are not only valid but still functional.

> GitHub CEO acknowledges the issue... still waiting for them to pull the plug or make a comment. :popcorn:

Quoting https://twitter.com/pkell7/status/1411058236321681414

Paradigma114y ago

I really dont think that stuff hosted in public repos can be classified as secrets.

villgax4y ago

We are in a time where you have a crystal ball/chip & who can whisper sweet nothings & get back answers

input_sh4y ago

There truly is an XKCD for everything: https://xkcd.com/2169/

ibraheemdev4y ago

Was the tweet taken down?

tgsovlerkhgsel4y ago

Tweet taken down (?), does anyone have a mirror?

1 more reply

j / k navigate · click thread line to collapse