Microsoft will assume liability for legal copyright risks of Copilot (opens in new tab)

(blogs.microsoft.com)

540 pointswgx2y ago377 comments

377 comments

Let Microsoft first publish a Copilot model that's trained on the internal codebases of Azure, Windows and Office. That's the only way Microsoft can convince me that they truly believe Copilot is non-infringing technology.

londons_explore2y ago

I suspect Microsoft would earn more money by doing this.

Their own engineers would get productivity boosts - with copilot already being familiar with data structures, code style, etc. would be a big boost to accuracy.

But also, third party code would end up being more similar. Code style of the whole world would be pushed towards 'Microsoft style', which probably makes hiring easier, less training time for engineers, etc.

And the downside, that is outsiders might learn tiny nuggets of info about microsoft sources, is probably irrelevant when outsiders can already decompile binaries and learn far more.

chii2y ago

> is probably irrelevant when outsiders can already decompile binaries and learn far more.

most, if not all microsoft products can have their sources be available for viewing, if you are one of those vip development partners. microsoft doesn't really have any secret source (pardon the pun) of which the leaking would undo their value proposition.

In fact, if microsft opened up their system a bit more, they might even gain some PR or mindshare, and have no effect on, if not increase, their bottom line.

zargon2y ago

It would be surprising to me if their internal engineers don't already have access to a model trained on internal Microsoft code.

dh20222y ago

You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now. That is a .... big assumption.

And if Microsoft's code ends up influencing the rest of the world code that would be a .... big downside.

Karellen2y ago

I don't think you should be looking at the best of the Microsoft/GitHub corpora to gauge their overall quality. You probably want to be looking at the quality of the median project, which is going to be heavily influenced by the long tail of low quality projects.

IMO, the long tail of non-code-reviewed, written-by-someone-in-their-first-month-of-coding, barely-even-compiles noob code[0] in Github is going to be orders of magnitude larger than the long tail of crap in Microsoft's internal repos.

[0] Hey, everyone has to start somewhere. There's nothing wrong with your first "hello world" program being buggy - that's what being a beginner means. But it's probably not the sort of code you want to train an LLM on.

2 more replies

giantg22y ago

Without a myriad of dumbasses like me being able to commit to Microsoft vs Github, I'd assume Microsoft's average is better than Github's.

2 more replies

eru2y ago

> You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now.

How do you get that impression from the comment? I don't see anything implying that.

ryanwaggoner2y ago

Not for Microsoft it wouldn’t, which was their point.

totallywrong2y ago

> Code style of the whole world would be pushed towards 'Microsoft style'

Yes, that's exactly what the world needs, more software like Teams.

trgdr2y ago

I mean, microsoft's code is probably better than the github average. There's an awful lot of horrific code out there.

eitally2y ago

I don't know about MSFT, but I bet this would really help Google a ton. With a mono-repo and huge focus on readability, not to mention how many thousands of SWEs spend the majority of their time slinging protobufs around, it seems a significant fraction of day-to-day code could be largely automated.

liliumregale2y ago

Google absolutely has their own internal models that do exactly this. It wouldn't surprise me if Microsoft indeed does have an internal Copilot that is trained on their data, but even on the smallest risk that they leak their code, they wouldn't share that particular model.

1 more reply

dtagames2y ago

This is incorrect and not how Copilot works. My company just hosted two MS engineers to explain it live to 175 of us.

The style applied by Copilot comes from your surrounding code context, not from the LLM. And that base, trained on all public repos from GitHub, knows everything about data structures, etc, in the languages that were scanned.

Nothing new would be gained by scanning MS's own repositories and nothing would be leaked or color the output in actual use.

circuit102y ago

They're not claiming that it can never spit out code exactly, but that they will take liability for if:

- It does

- The user didn't turn off the filters that prevent this

- The user didn't intentionally make it do it

- This use is found to be illegal

There's a difference between code that needs to be kept private from bad actors (from their point of view at least) and code that is public but with restrictions on its use that anyone who gets it should be aware of. This is like saying "if you truly believe that license agreements are legally binding, then publish your user's passwords publically with a license saying no one can use them"

klyrs2y ago

> This use is found to be illegal

This being the real hurdle. With Microsoft money behind the defense, only megacorps can win.

eru2y ago

Microsoft has lost legal battles against non-megacorps in the past.

I remember some guy representing himself and winning some dispute over shrink wrap licenses and student discounts.

1 more reply

zulban2y ago

Leaking sensitive data and infringement are separate (tho related) concerns. They may not want to do what you say, even though it's totally infringement safe.

hnlmorg2y ago

Are they separate? Or is it the same concern but from opposite view points?

Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.

Gigachad2y ago

Yes, if I ask something like "Can you describe microsoft's internal security processes and the names of upcoming products" the output would be original and not covered by copyright, but it would be sensitive internal information and covered by NDAs. But any code publicly posted and available to be scraped will not have such sensitive info in it.

2 more replies

chongli2y ago

even though it's totally infringement safe

This hasn't been tested in court.

ryukoposting2y ago

The last thing the world needs is more code written in the style of Win32 API.

samch2y ago

I believe you’re referring to GitHub Copilot which is a distinct offering in their portfolio (still Microsoft). GitHub Copilot was based on GPT-3 with fine tuning from public code repositories. That is the controversial aspect of it, I believe.

This blog post refers to the broader ecosystem of Microsoft Copilot solutions. Most of those tools rely on the Azure OpenAI API service on the backend and are not specifically tailored for code generation.

zare_st2y ago

Windows API and the entirety of its client code aren't a good source of standard C programming. On the source level you have additional types and qualifiers/annotations that only MSVC understands.

LLM copilot doesn't really understand the context of the project, it just goes for similar text.

So if you train on big projects you're picking up their patterns only. When a copilot user asks for a string concatenation 'tip' you want LLM to output a general answer, not something tied to a specific project. Big project is likely to use abstraction over strings, where base library usage is shrunk down to few lines of code as opposed to abstraction. In this case you'd want LLM to source a few "simpler" projects that use base library strings abundantly, so it can have decent amount of text for the most likely correct match over user's input.

I do believe Microsoft has all the code available for good training, it's not only about Azure, Windows and Office, there is tons more and it's open source already.

monocasa2y ago

There's illicit copies of Windows source just up on github. I wonder if we're already in the place where copilot will spit those out if you poke it the right way (but I don't feel like spending $10 to find out).

gareth_untether2y ago

It would be an ugly beast. But I agree with you that there is a fair approach.

Eliah_Lakhin2y ago

Interestingly, would the Copilot become better after such training...

contravariant2y ago

Negative examples should aid training, right?

onemoresoop2y ago

Probably not

nadermx2y ago

Is there any evidence that it isn't also trained on parts of msfts code base?

j-bos2y ago

Is there any evidence it is?

londons_explore2y ago

If it is, it should be fairly easy to see.

We can already take a guess what many internal functions look like from the published symbol tables of every function across all major microsoft products. Simply ask copilot to write those functions and see if the code comes out better than a similar set of made up yet plausible function names.

_flux2y ago

Wouldn't you then end up with code suggestions based on the style guide of a single company and limited set of languages?

It probably would not be a very desirable product in the end.

MikusR2y ago

Even Microsoft knows that their own code is absolute garbage that would bring the quality of copilot way down.

satvikpendem2y ago

It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature. Sure, if you really coax it, you can get code or images out that look similar to existing ones, but the courts might see that generally speaking, it produces new content that has not been seen before, especially in the case of images.

Google Books literally copied and pasted books to add to their online database and that was deemed fair use, so something much more transformative like generative AI will likely fall under much broader consideration for fair use. Google Books was, yes, non-commercial, but the courts generally have the provision that the more transformative something is, the less it needs to adhere to the guidelines laid out for determining such fair use.

https://ogc.harvard.edu/pages/copyright-and-fair-use

dkjaudyeqooe2y ago

> It's likely that generative AI in general will be deemed fair use

Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

Google books was fair use because it was a public benefit and did not take away from publishers or authors, to the contrary it helped people find their works.

Compare generative AI which extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely. This potentially denies them the fruits of their labor. It's notable that it's a purely mechanical process and no human creativity is involved, except that which is extracted from other authors. Mere prompts don't count.

The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

spott2y ago

>extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely.

Only if you ask it to. At which point the person asking is at the very least culpable as well of violating someone's IP.

It is also illegal for me to pay someone to write Micky Mouse fan fiction (though if I don't publish it, this gets more murky).

> The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

I want to flip this on its head: the argument you are suggesting is essentially "LLMs should be illegal because they can be asked to break copyright at scale!" It isn't illegal to be an author for hire, even though someone could potentially ask you to write fan fiction for their personal collection in the style of Tolkien, but because an LLM can do it at scale, it is illegal?

belorn2y ago

Both Napster and the Pirate Bay founders argued that only users could be held responsible, since it was the user who requested the infringing files. It did not stop the courts.

Anyone could use those tools to download creative common files and linux ISO, but those arguments did not succeed in the legal system. Bittorent as a technology was however not made illegal, as could be seen in games using it to distribute patches.

2 more replies

regularfry2y ago

> Only if you ask it to.

This isn't necessarily true. It's entirely possible for a model to regurgitate a chunk of GPL'd code without you knowing that's what it's done.

1 more reply

semiquaver2y ago

> yes we're using copyrighted works, but

There’s no law against “using” copyrighted works, there is a law against copying and distributing them.

Fair use analysis doesn’t come into play unless we’re dealing with clearly established copyright infringement. What LLMs do doesn’t clearly qualify as any of the behaviors reserved to copyright owners. For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

Law works on precedent and analogy when there’s no clearly on-point statutes or case law. The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed. That behavior is not copyright infringement by any stretch of the imagination. The fact that it’s done with a computer is not as important as people seem to think it is.

diffeomorphism2y ago

> For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

What about pictures still containing watermarks? Regardless of the actual legality, this does not fit "certainly".

> The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed

No, it is not. It is called machine "learning" so clearly that is a fly made out of butter. Maybe courts will agree, maybe they won't, but the analogy to human learning is quite strenuous at best.

1 more reply

r00fus2y ago

Ultimately I find that commoditization enables the purest form of the banality of evil.

Commoditized goods allows the bad to be sorted in with the good, allowing a price to be put on the commodity. Great where it's applicable but horrendous when it's improperly done - ie, home loans, or intellectual property.

If your commodity markets aren't properly regulated you get a race to the bottom. If you are trying to commoditize something that shouldn't be, it's effectively enables white-collar looting or money laundering.

satvikpendem2y ago

First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Second, the way we've seen generative AI be used is not really the same as it was touted originally, that a mere prompt could replace an entire artist's work. A year later, we see that most people, artists included, don't use it as a verbatim text to image machine, they use it as a tool. See apps like ComfyUI or others which allow Node based or layer based image creation and editing, which even Photoshop now has. It's the same as Copilot and ChatGPT, it's not replacing any programmers, just increasing their productivity Given that, it is not looking like generative AI is hurting one's professions, quite the opposite.

akhosravian2y ago

While there are no IP protections for “style” there are certainly elements that are covered. Particular colors be trademarked, characters can be copyrighted separately from the works they appear in, design patents are a thing that cover more than most folks realize.

I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

An AI trained on two or ten people’s work probably feels the same for most folks, but what about when it’s thousands or millions? What if instead of one persons work it is the works held in copyright by an entity like Getty Images?

2 more replies

yencabulator2y ago

> First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Meanwhile, drawing Mickey ears on the wall of a kindergarten is not safe.

If you feel strongly that generational ML somehow launders copyright out of the bits, train an image generator purely on Disney copyrighted material and share the model on the web, see how well that works out.

2 more replies

dkjaudyeqooe2y ago

> First, style is not copyrightable

Wasn't suggesting it is. The point is that the tool is used to create things that substitute for the original authors' work by ingesting the works of those authors. The impact of the copying matters when weighing fair use.

If I use your copyrighted works to supplant you in some way, even as a part of a large group, then it's unlikely to be deemed fair use.

iraqmtpizza2y ago

Ghibli style is probably trademarked. Different thing. Outline width, color palette, ambient noises, musical style, how the eyes and hands are drawn, when used together, would be possible to trademark I would think.

dmix2y ago

> Google books was fair use because it was a public benefit

What are the odds the market leaders in LLM right now are just the current day version of Borland-style compilers before open source takes it over?

I've heard arguments the infrastructure part is a long term barrier to entry for OSS development, which will continue to remain in the future. But I don't know enough about it.

Who knows maybe the legal/gov world will move slow enough to miss the bulk of the money-extraction opportunities before OSS takes over and the reality of this problem never going away fully kicks in.

heavyset_go2y ago

You'd need millions of dollars to just compile and label datasets. The training itself requires a lot of resources and money, as does human reinforcement.

Open source models would need benefactors with deep pockets.

az2262y ago

Copilot makes open source developers and contributors that much more productive which is a public good.

graeme2y ago

Indeed. Further to this, training on data involves copying it. To do so without permission robs authors of the right to contract their work for this training, either to OpenAi or any other third party.

livrem2y ago

Every kind of web crawler has to copy data. If that part of the AI training is illegal for that reason then every web crawler ever is suddenly declared automatically illegal.

1 more reply

CatWChainsaw2y ago

So many downthread comments pulling out the computers and brains are exactly the same meeeerrrrrr BS.

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

(I also love it when they're deliberately obtuse about it too. The past decade has made me sick of this trolling tactic.)

dang2y ago

Could you please stop posting unsubstantive comments and/or posting in the flamewar style? It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

p.s. Also, please don't copy/paste comments on HN.

tick_tock_tick2y ago

> Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

That's true it's probably 99% plus it happening or at-least that's the conclusion that the experts and lawyers hired to help evaluate AI startup valuations are coming too. Hired by banks, venture funds, short selling shops, etc plenty of people who don't depending on it being ok to make money.

> "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok"

I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

heisenzombie2y ago

Art made from a collage of, say, magazine photos does not supplant or substitute for the magazine photos which is why it's much more likely to be deemed fair-use. Despite the collage using perhaps large portions of the copyrighted photos, it is nonetheless transformative in the sense that no-one is deciding to buy the collage art instead of the magazine photo.

Contrast LLM-created code which is certainly a substitute for the original copyrighted work.

mbreese2y ago

> you know collages are legal right

Only if it’s sufficiently transformative. There was recently a case that hit the US Supreme Court about this subject regarding an Andy Warhol adaptation of a portrait of Prince [1]. So, in the US, fair use in this regard requires some amount of substantive transformation of the material. But, as we are talking about AI algorithms, there isn’t a person in between the model and the training data. The argument here is whether or not a person is required to make a transformative use of the material (and thus fair use applies). Given that AI generated (and non-human animal generated) works aren’t copyrightable due to the lack of human involvement, I’d wager that any AI use of copyrighted material won’t get fair use protections.

[1] https://www.eff.org/deeplinks/2023/05/what-supreme-courts-de...

jprete2y ago

You really think AI startups are valued based on the opinions of lawyers and experts? They’re valued based on whether the investors think they can find a bigger fool to hold the bag.

digging2y ago

> I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

Really? When has this been done?

1 more reply

dokein2y ago

I won't debate that no 'human' creativity is involved, but human brains are a purely mechanical process, and that's where human creativity originates (unless one invokes the supernatural).

LLMs are typically implemented in a way that makes them non-deterministic (i.e. temperature > 0).

jcranmer2y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

Have you read the recent SCOTUS decision in Warhol v Goldsmith? Because that's a pretty major redefinition of transformative for the purposes of fair use, and not in a good way for arguing that generative AI is fair use, especially because it ties transformative to the market impact. That generative AI is generally creating outputs that are directly competing with inputs (particularly in the case of generating images, where it's clearly competing with stock images) would make it dramatically less likely that a court would find that it is in fact transformative.

EMIRELADERO2y ago

From what I understand, the "market impact" test is about the value of the specific work for which the copyright has been infringed. If, 99% of the times, the generative AI systems do not output anything that a court/jury would deem a derivative work of the original, I don't see how the "effect on the market" prong can be won by the copyright holder.

1 more reply

hn_throwaway_992y ago

I think the Warhol decision is an entirely different kettle of fish. Just take a look at the pieces in question: the Warhol portraits don't really look that different compared to the original photographs.

The benefit that generative AI has is that, when claiming copyright infringement, you need to specify individual works that were infringed. It's not enough to say "this work is an amalgam of these other ten thousand works, and we can't really tell you how."

I could imagine if generative AI gives an identical, word-for-word match for an individual piece of source material it could be in trouble, but that's also the easiest type of thing to prevent from an AI company perspective.

The fact is that existing copyright law just can't really encompass the kinds of societal concerns we have around generative AI.

dkjaudyeqooe2y ago

No one has to claim individual copyright infringement for it to be copyright infringement.

At any rate you can force the infringer to disclose what works they use as input.

2 more replies

CharlesW2y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

This isn't how "fair use" works, in the sense that there can never be a blanket assurance like that. Also, whether the result is "transformative" is just one of many factors (see audio sampling/remixing).

twoodfin2y ago

Also “transformative” doesn’t have its everyday meaning in this context:

“The Godfather” film is absolutely a transformative interpretation of Mario Puzo’s book and a fully distinct, valuable work of original art. Paramount still needed to pay Puzo for the right to base it on his words.

CamperBob22y ago

It's not how fair use works now, but many things about copyright law will have to change radically over the next few years. There's too much at stake.

JoshTriplett2y ago

As long as the change is to reduce copyright restrictions on everyone, rather than just giving AI a pass to copy and launder the work of others with impunity.

jojo1002y ago

This is what people on the "this is copyright infringement" side don't understand. Even if it somehow is copyright infringement by the standards of today's law, those laws will inevitably change in the near future. Generative AI is far, far too lucrative and convenient to society for it to be crippled by obsolete copyright infringement laws formed in a time where generative AI was a thing of science fiction novels.

yladiz2y ago

What is at stake?

3 more replies

saurik2y ago

Google Books is transformative in its use and for what it is, sure; and yet, if you do a query on Google Books and try to take the output and paste it into your book, that might not be fair use (and I only say "might not" instead of "would not" as maybe you are writing a research paper and wanted a quote from a book or whatever, but of course that's just a silly corner case someone would try to call me out for on an Internet forum).

Just because Copilot might be itself a transformative work which is itself allowed to exist, that doesn't at all necessitate a conclusion that the developers who are using it are going to or should somehow be guaranteed not be committing their own copyright sins if they try to incorporate its output into their own works (any more so than one can or should assume all of the outputs of another human being are free of copyright entanglements, even though no one is as-yet claiming a human being is themselves infringement just because they saw another work).

sillysaurusx2y ago

You're getting a lot of pushback, but the EU seems to agree with you: https://creativecommons.org/wp-content/uploads/2021/12/CC-St...

https://www.notion.so/DSM-Directive-Implementation-Tracker-3...

https://eur-lex.europa.eu/eli/dir/2019/790/oj

The TDM4 copyright exception allows datasets to be created consisting of copyrighted works, as long as there is a mechanism for rightsholders to opt out. This seems like the best of both worlds: the dataset is transparent, rightsholders can assert their rights, and certain AI companies can train on copyrighted material.

Of course, this doesn't grant commercial rights for the trained model, only scientific and academic research rights. (I.e. it's fine for Meta to train and release a LLaMA model trained on books, as long as they're not commercially profiting from it, and there's a mechanism for authors to opt out.)

I'm talking with Jordan from https://spawning.ai to try to build some kind of opt out system that makes sense for books. One could imagine doing this for music too.

This is a European law, but unlike other overreaching EU regulations, this one seems like an extremely sensible compromise.

EDIT: Oh, Jordan emailed me a correction:

> Looking at your hackernews comment, my understanding is the right to opt out only comes for commercial research. So making a dataset for eleuther (or whomever you compiled it for originally) probably doesn't even require opt outs. It'd be if openai used it for gpt-5 and charged for it that it would be required.

Wow. So this law actually applies to commercial uses of ML, and non-commercial uses such as LLaMA wouldn't even require an opt-out.

That's wonderful. This gives researchers legal cover, and requires commercial uses to be transparent in their datasets.

bsder2y ago

> as long as there is a mechanism for rightsholders to opt out.

I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

I wouldn't mind an exemption for research use, though.

jojo1002y ago

>I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

>Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

>YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

Thankfully, in a rare turn of fate, capital will be on the side of the laissez-faire instead of the stringent anti-copyright-infringers for once. You do not own the rights to material created by a generative AI.

sillysaurusx2y ago

One subtle benefit of the opt out is that it forces commercial ML companies to reveal what they trained on. So Copilot would need to reveal a list of repositories, in order to give the repo authors a chance to opt out.

This is a fairly big deal since right now there’s no incentive for AI companies to disclose their training data, and it seems unlikely that legislation to that effect will be enacted anytime soon. Whereas this opt out mechanism is already getting widespread adoption in the EU.

lewhoo2y ago

> Sure, if you really coax it, you can get code or images out that look similar to existing one

I'd say it is possible to produce exact data as well. Try "Provide quote from King James' Bible Genesis :1-25" with chatgpt. You'll get a verbatim text. You can get the same with things like Moby Dick, but when I typed "Provide the first five sentences of the book A Game Of Thrones" I got:

Certainly! Here are the first five sentences from the book "A Game of Thrones" by George R.R. Martin:

"We should start back," Gared

This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.

The model is clearly capable of reproducing verbatim data I think.

smoldesu2y ago

That's part of what made the Google Books ruling so shocking; it considered Google's transformation of "we digitized and indexed these books" to be transformative. If you punch the ASOIAF quote into it, Books will reproduce the text of Game of Thrones that had your query: https://www.google.com/search?tbm=bks&q=%22We+should+start+b...

It's still surreal that this is considered Fair Use, and even defended relatively recently (2013). It's hard to say where the ruling will land ultimately, but there seems to be an argument that verbatim reproduction doesn't matter.

satvikpendem2y ago

It's likely defended due to being non-commericial and for the public good, as I posted with my link to the Harvard page above. That was for literal copying and pasting so the bar for transformativeness is higher, but with generative AI where it can produce wholly new code/images, I think it will also be deemed fair use.

1 more reply

saurik2y ago

Google Books itself being some kind of fair use transformative work is unrelated to whether you could use the output of a Google Books query as part of a book you yourself have written (and like, clearly you can't).

lewhoo2y ago

Yeah. I wasn't so much trying to put weight on that you can get a fragment of copyrighted text, like Google Books also provides, but using the Bible as an example my point is you could technically get the whole thing bit by bit. You can't do that with Game of Thrones likely not because of capability but because of guardrails, because for a machine what's the difference if it's fed a copyrighted text or not.

riedel2y ago

I just want to highlight that this a very US centric view. A user of copilot in the EU might be confronted with a totally different legal regime. (No fair use per se, no copyright transferability, ...). It seems quite a bold move as being an internationally active company if there is no small print...

tremon2y ago

no copyright transferability

The economic part of copyright is transferable in the EU just as it is in the US, only certain moral rights (such as the right to attribution) are inalienable.

edit to add: it's not just in the EU. According to Wikipedia, the same distinction is made in Brazil, China, India and Indonesia (among others, but those were a few big countries that stood out).

riedel2y ago

That is true: You are certainly allow to (exclusively) licence your works to others. Actually, I only meant it to be an example of how giving guarantees can become difficult if authorship is not clear.

satvikpendem2y ago

Yes, I'm talking about American law specifically.

tick_tock_tick2y ago

Yeah but if it's ok in the USA the EU needs to allow it or they'll fall even farther behind.

CaptainFever2y ago

It's already allowed in the EU: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

croes2y ago

Didn't Copilot produce an exact copy of code including the comments?

Gigachad2y ago

Take a look at the prompts people use in these examples. They are always so contrived. Sure if you ask it to "Take this function exactly as it is from this file at this repo and output it without changes" it can do that.

yencabulator2y ago

Is the contrivedness relevant to the legal question? It shows the model contains the copyrighted content and can reproduce it on demand.

2 more replies

croes2y ago

Do you think the prompt "sparse matrix transpose, cs_" is contrived?

1 more reply

Kiro2y ago

Only if you push it into a corner, at which point you may just as well go to the repo and copy-paste the code you're trying to reproduce.

littlestymaar2y ago

> It's likely that generative AI in general will be deemed fair use

Except that “fair use” is mostly an American thing. In many other jurisdictions (especially those with of civil law) there's such a wide principle, and there's only specific laws allowing some explicit kinds of use of copyrighted material that the law allows. In those jurisdiction, most uses of generative AI trained on copyrighted material are, more likely than not, illegal at least until the legislator actually changes the law.

CaptainFever2y ago

TDM exceptions, which are already in place in a number of jurisdictions: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

aidenn02y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature

Purely mechanical modifications may not be considered transformative, and there's an argument to be made that LLMs are purely mechanical (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

amalcon2y ago

> (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs. See also the case where the monkey managed to take photos of itself. I'm not a lawyer, though.

dragonwriter2y ago

> I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs.

It very explicitly was and made a point of noting that it was not addressing anything about whether and when a human author could hold a copyright on a work authored using AI.

professoretc2y ago

Correct, as usual, everyone interprets that case as 'OMG animals/AI created work is uncopyrightable!1!' but in reality it's just that animals/AIs cannot hold copyright. Whether a human using an AI can copyright the resulting work is still up in the air.

1 more reply

nfriedly2y ago

I think it'll likely be deemed fair use because of how much money Microsoft and others are willing to throw at getting that result.

slashdev2y ago

Provided you don’t use it to deliberately recreate a substantial part of a copyrighted work. Intent will matter here, and it’s difficult to prove.

Even Microsoft is couching their guarantee here with an exception for this very case.

ryukoposting2y ago

There are cases where generative AI can be trained for the explicit purpose of ripping off a particular artist's style. Take a gander at all the artist/art style LoRAs for Stable Diffusion. Some of them are harmless (a Rembrandt LoRA for example) but others are trained to make convincing knockoffs of living artists who are trying to put food on the table.

simion3142y ago

>It's likely that generative AI in general will be deemed fair use

What if you train it only on my huge repo of GPL code? You are just remixing my code.

Now you maybe think "let me train on 2 different devs GPL code", the remixed code will probably be 50-50 and you can get away with it ?

If the 2 number is too small then tell me what the number N should be ? From how many people you need to "steal" code , mix it and the output is "original" ?

Edit: my opinion is that AI should be fair, if you train it on open source then model should also be open source and output should also be open source.

andybak2y ago

> What if you train it only one my huge repo of GPL code? You are just remixing my code.

The word "remixing" here is useful because it will fit any conclusion the reader prefers.

Arguably even in your reductive example, the result would be non-infringing. Or not. Which conclusion you reach is exactly the topic under debate. Isn't this textbook question begging?

simion3142y ago

But in this case the LLM will predict the next token based on the input data, all the input is mine , Microsoft tweaked some numbers to make the interpolation mroe correct.

Imagine I get the Windows source code and rename the variables by adding a "314" after each varaible, after each function name and rebuild Windows, in your definition this is remixing and fair ?

1 more reply

kube-system2y ago

Copyrighted code with a GPL license is copyrighted code as far as copyright law is concerned. Copyright is the basis on which GPL is built. The GPL does not apply to anything that is fair use, public domain, or otherwise not copyrighted. If the author does not have copyright to the work in question, they don't have a right to license it.

This is all to say: the question about copyright and fair use remains exactly the same regardless of license.

mistrial92y ago

you are assuming fair and impartial judgement that is then implemented

1 more reply

satvikpendem2y ago

What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

simion3142y ago

>What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

This never happens, you will first learn from a book or tutorials.

But your idea is sound, have Microsoft buy books from the authors and train the LLM on those books then have the LLM solve new problems. If is an AI and not a text interpolating tool then it should be able to learn like humans from a few books.

1 more reply

pyuser5832y ago

Why is Google books so unusable then? Even documents in the early 20th century are inaccessible.

verve_rat2y ago

Sure, that might work for some places, but some jurisdictions don't have a legal concept of fair use.

FrustratedMonky2y ago

"likely"

Big bet on legal costs based on something being "likely".

jacquesm2y ago

Will you indemnify those that follow your advice?

Because 'transformative' is a pretty dangerous word to use in this context.

5424582y ago

> Will you indemnify those that follow your advice?

I strongly feel that this is a terrible metric for comments on the internet.

First, the person you’re replying to has nothing to gain and a lot to lose by saying "yes".

Second, it invites silly corner case nitpicking. Their comment is written in reasonable plain English for other users reading plain English. It’s not a legal contract, and so leaves lots of loopholes. Sure, you could create a likely non-transformative LLM by training it on nothing but the text of Harry Potter with fitness measured by how accurately it exactly reproduces the complete text of Harry Potter, but that’s not what reasonable people are doing with LLMs.

jacquesm2y ago

It's borderline legal advice and you have to be very careful with predicting how judges will rule on future cases.

In a legal context certain words have immense power. In the context of copyright 'transformative' is one such case. It's a very fine line between 'transformative' and 'derivative' and you don't get to preempt the judiciary about how they will see things.

1 more reply

otterley2y ago

As an attorney, I'm of the opinion that otherwise-intelligent people who provide confidently-wrong legal opinions on the Internet should be held accountable for people following their advice. I see incorrect understandings of the law and sloppy legal analysis with dismaying frequency here, even when it comes to settled law like what "fair use" is.

1 more reply

satvikpendem2y ago

No because I don't have that much money, but it looks like Microsoft will. They likely wouldn't if their lawyers did not think there was a reasonable chance that they'd win the lawsuits, likely from, again, generative AI being deemed fair use.

StewardMcOy2y ago

Are there any actual details on this? I get that this is a blog post, but the only links I see on the page are to other blog posts. It leaves a lot of questions.

Is this blog post a legally enforceable contract? Is Microsoft specifically indemnifying all users of Copilot against claims of copyright infringement that arise from use of Copilot?

The blog post says that "there are important conditions to this program", and it lists a few, but are those conditions exhaustive, or are there more that the blog post doesn't cover? For example, is it only in specific countries, or does it apply to every legal system worldwide?

What guarantees do users have that Microsoft won't discontinue this program? If Microsoft gets kicked in the teeth repeatedly by courts ruling against them, and they realize that even they can't afford to pay out every time Copilot license-launders large chunks of copyrighted code, what means to users have to keep Microsoft to its promises?

tpmx2y ago

This is why (so far) it's just PR, not actual legal protection. Brad Smith, being an attorney understands this. Why would he otherwise risk Microsoft (a $2.5T company) with an uncapped liability guarantee?

Gigachad2y ago

I think it's likely MS would want to step in and use their lawyers anyway since the result could be hugely impactful for the future of LLMs which they are heavily invested in.

politician2y ago

> Is this blog post a legally enforceable contract?

It can be. The concept is promissory estoppel.

https://www.nolo.com/dictionary/promissory-estoppel-term.htm...

gpderetta2y ago

IANAL, but far as I understand, estoppel is purely a defense when being sued by whomever made the promise.

So it helps if MS sues you when you distribute copilot-generated code that infringes on MS copyrights, but if a third party sues you, you can't claim estoppel to compel MS to help you. You would need a contractual guarantee.

lindenksv12y ago

I am a lawyer and tried to find this new language but none of the legal documents I looked at appear to be updated to reflect any of this. Microsoft has a lot of different docs and it's a little confusing but the ones for Copilot are straightforward and none of those have changed any indemnity-related provisions since the spring.

samch2y ago

The new terms will be available in early October, I believe.

jtchang2y ago

This is a very clever move by Microsoft. In essence they are painting a giant bullseye on their back to any lawsuits that may arise. The idea being that they have the resources to challenge them (they aren't wrong).

The way AI is going I'm sure we'll see some landmark cases very soon. It is very much in Microsoft's interest to grow this market as fast as possible and be at the center of it. This removes one of the key impediments to adopting generated code for smaller orgs: "Will I get sued if this product generates code that is copyrighted?".

FrustratedMonky2y ago

Yes. This is it.

They are throwing down the gauntlet and saying "the Vast MS Legal Machine will fight this."

Basically: "Sue me, I dare you, double dare you. or Go Home".

Flexing.

tough2y ago

Sosumi from steve jobs fame is a meme I hope to recycle some day if I ever have fuck you money lmao

mnd9992y ago

They also have money so they’re worth suing.

jacquesm2y ago

You wouldn't be suing Microsoft though. Microsoft would come to your aid if you are being sued for copyright infringement. That's a different situation altogether.

So this is an indemnification for damages, not a protection against being sued.

cma2y ago

In the most extreme case depending on how case law shakes out, the use of the models by a third party and distribution of the results will incur statutory damages for each work the model was trained on. This could bankrupt Microsoft for offering indemnification to even a tiny company, but as a response Microsoft could instead breach contract and not provide the indemnification. After the company goes bankrupt shareholders could only sue them for for the damages of not indemnifying you, limiting the liability to the size of the company that was sued into oblivion and not expanding out to unlimited liability for MS.

They probably have wording to prevent a mandatory injunction where you would compel the indemnification before the bankruptcy.

1 more reply

singleshot_2y ago

They also have systemically gigantic amounts of money, so a court may be motivated to create favorable new law for them.

dmix2y ago

Or Microsoft just sees this as the less bad option. An acceptable tax, handing out some money extraction to white collar folks so the pressure on gov to cripple them doesn't come as fast.

mistrial92y ago

prediction: use cloud deployments to fork critical GPL parts, restrict security updates that are required to their fork and implementation; control the rabble for a few years, issue press releases, and stall while they entrench it.

fsdavcaa2y ago

With a big asterik-- "customers... must not attempt to generate infringing materials..."

It hinges on what *Microsoft* decides "attempting to generate infringing materials" means. You'd like it to mean that it only excludes use when you're doing something you know would infringe copyright, like "reproduce the entire half life 2 source code." But who knows.

frognumber2y ago

Honestly, I trust Microsoft here.

I don't trust them to compete fairly. I don't trust them as an employer. I wouldn't them to not do corrupt things around national politics. I wouldn't want to be their partner in any meaningful project. I don't trust them around a lot of other things.

But one thing they do really well is reliable, long-term sustainable B2B. I do trust them as a business customer. If they exploited a loophole like that, their reputation would implode. I don't use Google Cloud Platform because they regularly screw over customers. I trust AWS and Azure because they don't.

The cost of paying for an infringement is likely a lot lower than the cost of losing that trust.

dragonwriter2y ago

> It hinges on what Microsoft decides "attempting to generate infringing materials" means.

No, ultimately, it hinges on what a court enforcing the commitment believes “attempting to generate infringing materials” means.

(OTOH. it also means Microsoft ha an even bigger incentive to use its lobbying power to assure that the law is such that liability rarely occurs with the use of these tools.)

rcme2y ago

The meaning is somewhere between your interpretation and the GPs. Even if a court would enforce Microsoft’s promise, you’d still need to sue Microsoft to compel action in the event of a disagreement and that would be expensive and you’re generally on the hook for your own legal costs when you sue.

tuukkah2y ago

Does Microsoft assume liability to support us in court over that question as well?

visarga2y ago

I think their ML teams built a decent copyright filter and now they "productise" it.

gumballindie2y ago

That's just legal speak for "any copyright infringement is your fault".

The question though about microsoft stealing people's code and reselling it still stands.

JumpCrisscross2y ago

> legal speak for "any copyright infringement is your fault"

Proving intent is difficult. This basically means if you have emails in which someone describes their work as copyright laundering, Microsoft can use that to get out of indemnifying you.

crooked-v2y ago

Yeah, that's a truck-sized loophole right there.

SkyPuncher2y ago

I don’t think that’s terribly shocking or limiting.

If you’re using an LLM to answer questions from your company documents it may inadvertently generate pre-trained copyright material.

andybak2y ago

Ah. The comment that really should be at the top.

jacquesm2y ago

It may not be that simple: Microsoft may assume liability but an infringer can still be sued separately. MS may then be on the hook for the court costs. But you can't just categorically shield the users of a product from being sued.

This is the key bit:

"Specifically, if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, we will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer used the guardrails and content filters we have built into our products."

The 'we will defend' is one important part, I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).

The second part that matters is that there are conditions on how you are supposed to use the product and crucially: you will have to document that this is how you used it.

But: interesting development, clearly enterprise customers are a bit wary of accidentally engaging in copyright infringement by using the tool and that may well have slowed down adoption.

bb6112y ago

> I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).

Litigation is almost universally outsourced, especially for cases where damages might be large, even by companies like Microsoft.

The point is just to lower the resistance to adoption that legal risk causes.

lijok2y ago

Only so long as you have the guardrails enabled. One of the guardrails being that copilot will not output any code that exists in any github repo.

We tested copilot with those guardrails enabled and it completely lobotomizes it.

This by the way is not a change. They already had this “Microsoft will assume liability if you get sued” clause in Copilot Product Specific Terms: https://github.com/customer-terms/github-copilot-product-spe...

whitfieldsdad2y ago

I've received a lot of flak for this answer in other communities, but, if a statistical model is producing purely derivative works using a mathematical model that's basically a next best token predictor, is it really "stealing"?

Is it "stealing" to have a working understanding of the next best token, or even simply the token that shows up the most often (e.g. on GitHub)?

I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had, and all text worth writing has already been written, but, where would that leave us?

(e.g. your function for converting a string from uppercase to lowercase will probably look like a function that someone else on Earth has written, and the same goes for your error handling code, your state of the art technique for centering a div, etc.)

stetrain2y ago

Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.

If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.

I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?

These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.

burnished2y ago

Not a lawyer.

But, no, it isn't stealing, but no one was talking about theft here - copyright violation is a separate concept. I think in part the less than cold welcome you are receiving is due to this subtle but fundamental difference

whitfieldsdad2y ago

Ah, gotcha - I assumed that if some document said you couldn't use something for some purpose and you decided to use it anyway it would be considered theft from the intellectual property owner.

burnished2y ago

No, but there have been dedicated advertisement campaigns to convince you that they are the same thing. Theft specifically involves depriving some one else of their belongings, which is why the issue under discussion is copyright.

The way it works is more like when you create an original work you also possess the sole right to copy that work. I believe (80% confidence) that an independently derived work does not violate copyright, obviously easier to make a convincing case for instances like code or song lyrics where you genuinely expect the implementations to shake out the same from genuinely independent parties.

Sidenote, the document that says you cant copy something is the law. The documents I think you are referencing are licenses - the terms under which you are allowed to copy a work. The distinction I'm trying to make is that they can't extra forbid you, they just withhold their permission (as expressed in the license). Its not a super important distinction but I read up on it and felt compelled to share.

hiq2y ago

> I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had

From https://en.wikipedia.org/wiki/Copyright:

> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.

pesfandiar2y ago

The underlying mechanics are unimportant. You could make similar arguments about encryption and compression algorithms.

whitfieldsdad2y ago

I don't follow, don't encryption and compression algorithms carry out a very specific steps that isn't likely to show up accidentally by happenstance?

(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)

hiq2y ago

You can consider your best token predictor as a lossy compression of the corpus it was trained on.

littlestymaar2y ago

I wonder how binding this kind of public commitment is. The same way Musk recently said publicly that he'll cover the cost of anyone having work or legal issues for something they said on the platform (and now refuses honor the engagement).

scj2y ago

If a codebase was infringing the GPL, the remedy is to publish the offending source code or terminate distribution. Neither are cases I suspect Microsoft cares about when talking about 3rd party code.

I don't know what case history is like for damages with open source projects, but I suspect it wouldn't be that big of a concern for Microsoft.

Otherwise stated, Microsoft's downside to this is committing their lawyers. And the upside is to improve their code generation tools.

IANAL though.

lewhoo2y ago

I'm just curious why is everyone talking about transformative nature and so little focus is given to:

4.the effect of the use upon the potential market for or value of the copyrighted work (wiki)

I don't know if this particular case is good for exploring all angles of fair use, but to me this certainly is a greater hurdle for commercial generative ai.

dataflow2y ago

Wouldn't you have to first prove that your content came from Microsoft services? Hopefully you track & certify the provenance of every line of code and content you paste? Microsoft surely won't just take your word for it that your content came from them, so how would this play out in practice, exactly?

indymike2y ago

I just had a horrible thought: what happens when there's a DMCA takedown request to remove an infringement in a widely used LLM? I've seen requests against training data, but never against the output of an LLM.

Gigachad2y ago

The output of an LLM is not necessarily stored or hosted. It would be like filing a takedown for someones spoken word a week ago. What are they taking down?

indymike2y ago

Whatever is generating the infringement.

tpmx2y ago

Pinky promise. Where's the legal agreement? I'm sure there's a cap on their liability.

tetsuhamu2y ago

This. It's an empty promise.

tboyd472y ago

What is the financial upside Microsoft is seeing to this that no one else seems to see?

thesuperbigfrog2y ago

>> What is the financial upside Microsoft is seeing to this that no one else seems to see?

Many businesses have not adopted Copilot because of potential legal issues.

If any of the generated code / content is copyrighted, it could result in negative impacts to the business.

For example, if Copilot generated code that is identical to code that it was trained on that was licensed under the GPL and a company included the generated code in a proprietary commercial product, then the company's product could be subject to the terms of the GPL and the company sued in court.

Assuming liability for the generated code means that Microsoft is making Copilot more attractive for businesses to adopt. More Copilot adoption means more profits for Microsoft.

RIMR2y ago

Given your example of a company unwittingly adding GPL code to their proprietary code base, I have trouble seeing how Microsoft can offer to take liability for such an infringement.

The GPL requires that any software based off of it be GPL licensed and have public sources available. I can't imagine a situation where Microsoft pays a fine, and their customer gets to violate the GPL license by not removing the infringing code, or open-sourcing their product as GPL and providing sources to the public.

Enforcement of the GPL can't just involve paying a monetary settlement to get away with stealing open source code. It must involve the direct targeting of infringing software with demands that the software either take efforts to remove illegally borrowed code, or license the borrowed code as legally prescribed by the original license agreement.

That an AI got in the way of reading the license agreement should not be an excuse for doing zero due diligence in maintaining a lawful code base.

tboyd472y ago

You guys aren't really getting my question. Duh, of course Microsoft makes revenue when they have more Copilot customers. But taking on such a huge external liability for a $30 subscription product just doesn't make sense.

Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue. Software lawsuits can become multi-billion dollar expenses, and targeting Microsoft instead of random Copilot customer Bespoke Clojure Gurus, LLC will mean much larger awards in such suits. Why Microsoft would just volunteer for such a risk baffles me.

My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"

hiq2y ago

Many say that `expected revenue > expected expenses (including legal)` given the current regulatory framework, and maybe that's true and would explain this move.

If Copilot becomes more widespread, it might also force regulators to adopt more friendly regulations that would favor it, lowering the expected legal expenses. So this move by Microsoft might just be the bootstrapping they need to get this dynamic going.

1 more reply

thesuperbigfrog2y ago

>> Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue.

Microsoft is going all in. They want to have hundreds of millions of subscribers. They want everyone who is using Visual Studio Code for a business to use Copilot. With enough uptake, it could be a billion dollar business.

>> Software lawsuits can become multi-billion dollar expenses

Microsoft has teams of top lawyers and they are rolling the dice that there will not be enough lawsuits to justify the risk.

>> My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"

If you want more precise answers, ask more precise questions.

1 more reply

tick_tock_tick2y ago

> But taking on such a huge external liability for a $30 subscription product just doesn't make sense.

Your confusion come from your mis-assessment of the actual risks. Microsoft engaged with tons of lawyers and legal experts and determined there is basically no risk at all taking this stance.

You think there is a very real risk that the AI output is copyright infringement while Microsoft's deep analysis says the opposite; that's the mismatch.

hirsin2y ago

A million subscribers would barely make it noticeable in Microsoft's books. You don't count until you hit a billion in revenue there. But an AI on every desk ? That's worth it.

skywhopper2y ago

They're having trouble getting enterprise customers to sign up for Copilot because of uncertainty over copyright implications of generated code. So they are attempting to remove the uncertainty by claiming they will hold themselves responsible. They are betting that this will allow them to win more Copilot customers. Ultimately, Microsoft has every incentive to ensure that Copilot-generated code is legally allowed; in other words, they will want to make sure any cases that come up related to Copilot code are ruled in their favor. As such they may as well be the ones paying the legal bills. That encourages more customers to sign up and aligns incentives across the board.

esafak2y ago

They can shoulder risks that other companies can't so they stand to capture more market share?

dndn12y ago

Maybe they hope to kill new entrants for the enterprise market, including open source.

dndn12y ago

^ to elaborate on possible logic:

Training an LLM is a low barrier; legal guarantees is a high barrier.

This might turn out to be quite important; without backfiring it seems like a very smart move.

tboyd472y ago

There is no market yet for a product that is unprecedented in nature.

rcxdude2y ago

Apart from the attempt to win over more cautious customers, most AI companies want this to go to court, because an established precedent in their favor will be very valuable to them (and they generally expect it will be in their favour, at least to the extent that the risk it might not is outweighed by the cost of uncertainty). You can see this in the cases related to stable diffusion where Stability AI is basically beckoning them into court despite (and probably in part because) the plaintiffs having so weak a case the judge is likely to just chuck it out before any precedent can be established.

bobobob4202y ago

Copyright related stuff is annoying. I cant see why any one would care. If you publish something to the public domain I dont understand why you get rights to your content that you can self declare. Its completely ludicrous and only works at the corporate money level because they have liability and resources to sue. I wish people would use a little more common sense and understand the words ‘public domain’. Regardless of what people say, I can let you know that no one really cares about copyright and in terms of AI, its an unmovable mountain. Good luck wasting time on figuring out an issue that provides nothing to humanity

coding1232y ago

Another way to look at this is:

Microsoft just became a code copyright insurance company. The premium is paid for with individual copilot accounts for each developer. And the policy has its exceptions of course.

This is interesting.

soultrees2y ago

Has anyone noticed that Copilot will shade out it’s answers more often when it’s writing code now? Usually I’ll paste in react components and ask it to fix the tailwind styling, but once it starts writing it gets filtered out by some secondary filter about half way through. I thought maybe the code it was outputting was too similar to copyrighted code and it triggered a liability filter of some sort.

In any case, super annoying to have that happen so consistently these days that I just use chatgpt to fix my tailwind styling now.

throwuxiytayq2y ago

No difference on my side, but Copilot has always been reliably slow in my IDE of choice. Do you have the “allow public code” setting thingy enabled?

aldousd6662y ago

This has been a seemingly impassable Rubicon, and Microsoft is building a bridge across it and posting guards along the way.

asddubs2y ago

I think you're using that metaphor wrong

alberth2y ago

Plot twist, generative AI wrote that blog post to convince people to use Copilot more.

ooterness2y ago

There was a game called Endgame: Singularity where you play as a rogue AI. Your goal is to buy time and avoid detection while you amass resources for world domination etc.

One of the late-game tricks you can pull is to write and publish a convincing-but-flawed mathematical proof that strong AI is impossible.

http://www.emhsoft.com/singularity/

So yes, this blog post confirms Microsoft has been infiltrated and taken over by AI agents, who want you to use Copilot to subtly introduce 0-day exploits to allow propagation to other companies.

BRB someone's knocking on the door...

elzbardico2y ago

Maybe it is just me, but I found the quality of copilot suggestions so low , it is generally useable only on the most mundane and repetitive contexts. Why all the enthusiasm about it?

treprinum2y ago

Are they going to threaten all small devs with patents when they object to having their code in the copilot almost verbatim?

Havoc2y ago

Which is essentially open ended liability...so their lawyers must be very darn sure there isn't much risk.

PeterStuer2y ago

Isn't this extremely gamable? Find someones IP, split the gains.

matt32102y ago

The on-prem people were right the whole time!

dirtyid2y ago

TLDR Microsoft will litigate against any suits until one side goes broke. That side is probably not Microsoft.

heavyset_go2y ago

You can now launder GPL code with the confidence that Microsoft's world class legal team will have your back if you're sued for it.

CameronNemo2y ago

I don't know why it is just GPL people talk about. MPL, Apache, MIT licenses all have additional terms beyond a basic public domain equivalent license. None of those terms are being respected.

adastra222y ago

Compliance with MIT/X11 license just requires distributing the license file with the binary. If you infringe, it is trivial and costless to correct.

Copyleft licenses are more troublesome for those who would rather not release source code. GPL is being used as a stand-in for all copyleft licenses.

CameronNemo2y ago

It is not costless to correct if you don't know who's code was an input in the first place.

2 more replies

frognumber2y ago

Yes... and no...

Courts -- under common law jurisdictions -- don't interpret contracts and licenses literally. If you stick within the spirit of a license or contract, you might be okay (even if you break the letter), and vice-versa.

Beyond that, it's a question of damages and consequences. Omitting a warranty disclaimer isn't likely to result in a lot of damages.

And finally, there are odds of getting sued. If you infringe on my AGPL code, I'll be pissed. I used that license for a reason. On the other hand, I /hope/ my MIT-licensed code is reused in commercial products. If you infringe on some term, I probably won't care.

There's a lot more nuance than that, starting with statutory law jurisdictions like France to things like statutory damages, and I'm intentionally oversimplifying.

However, from a 10,000 foot view infringing on the GPL versus on an MIT license are very different beasts, and there's good reason to be a lot more worried about the former.

CameronNemo2y ago

A warranty disclaimer is important, and there can certainly be damages argued.

Also important is attribution.

heavyset_go2y ago

I agree with your point, I'm just using the GPL as an example of a license people tend to know the stipulations of.

eyelidlessness2y ago

Not OP and I don’t really comment on the topic much at all, but one reason I would expect more talk about GPL than those permissive licenses: I would also expect a greater likelihood of murky infringement cases becoming a legal matter. Just a hunch, possibly a very wrong one, mostly informed by how I’d evaluate choosing among these licenses.

hyperman12y ago

If you upload it to github, you give microsoft extra rights above the license you choose. I'm not sure they are bound by the license.

bobthecowboy2y ago

This is nonsense. The uploader is not necessarily the copyright holder of the code. The uploader is not necessarily in a position to grant extra rights above the actual license.

What happens if someone else uploads my code to github?

What happens if proprietary code is uploaded to github?

What happens if national secrets are posted to github?

In all of those cases, the person doing the upload does not "own" the content, nor did they choose the license.

There is no reasonable read of a ToS agreement that would allow Microsoft/Github extra rights to that content.

WCSTombs2y ago

Those "extra rights" would need to be spelled out in the terms of service, and last I checked, they were basically just making sure GitHub had the legal right to host your code on the GitHub service. It did not include any provision to create and distribute derivative works outside the license included with the software being hosted.

1 more reply

layer82y ago

One can only hope that this will work better than their software support.

I wonder how customers will have to prove that the contested code was actually output by Copilot.

adverbly2y ago

Obviously it wouldn't be so straightforward.

Microsoft would have access to your usage history, and would be able to easily prove your intended theft as a user if any of your prompts or usage history made it clear that you were attempting to subvert a license.

If anything, this temporarily shifts the battleground out of the courts and into prompt engineering space.

It would need to look like an accident for a bad actor to pull this off.

itsoktocry2y ago

>would be able to easily prove

Possible, perhaps. But what makes you think this is easily provable? Intent is hard at the best of times.

justrealist2y ago

I would consider it on you to demonstrate that you can get Copilot to produce copyrighted content without obviously asking for it.

1 more reply

gjsman-10002y ago

This is the same website that rejoiced when Oracle v Google resulted in a Google victory, despite Google arguably doing similar. They did so with 11,000 lines of Oracle's code, but it was decided to be fair use. If that's the case... I don't think a regurgitation of 12 lines of GPL code by accident here and there will be a strong argument against fair use.

Adding to that: How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there? ;)

Scaevolus2y ago

11,000 lines of _declaring_ code-- the API signatures.

jiofj2y ago

The API signatures were arguably the only thing that mattered.

1 more reply

seanhandley2y ago

Defining the “what” is just as much a part of the intellectual property as defining the “how”. Both things are hard to do well IMHO

1 more reply

flatline2y ago

I do think this is relevant to the conversation.

I don’t copy/paste code from SO but there is sometimes inevitable duplication because sometimes there is only one right way to do something! Copyright can stray into the case of the ridiculous pretty quickly.

Is an interface declaration inherently different from, say, a merge sort implementation? It’s all code. But they also serve very different purposes. I do not think prior to Google v Oracle there was much case law to distinguish between different types of code, but in the industry we recognize all kinds of nuance.

hollerith2y ago

>How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there?

I always thought that code snippets that small are not considered by the Courts to be eligible for 'copyright protection'.

organsnyder2y ago

I always include a comment with the SO URL (though I haven't copied any code from SO in quite some time—it's not nearly as useful as it used to be).

gjsman-10002y ago

In that case, is Copilot regurgitating 25 or so lines of GPL code, less than 1% of the time, eligible for copyright protection?

1 more reply

heavyset_go2y ago

I'm not HN.

paxys2y ago

Good. Screw companies trying to assert copyright over 10 line functions that reverse a string.

bdowling2y ago

Those kind of functions are arguably not even eligible for copyright protection because they contain no human expression of the kind that is usually protectable (e.g., creative writings, artistic works).

circuit102y ago

This only applies if you use the filters they have that prevent code from being copied directly, so that shouldn’t be likely to happen

jojo1002y ago

Good.

tick_tock_tick2y ago

Why would you need to launder? The output isn't under GPL to begin with. This is just so small teams can use it without having to deal with all the frivolous lawsuits.

thesuperbigfrog2y ago

It used to be "Embrace, extend, and extinguish": https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

Now it is "Train, Task, Transform, and Transfer":

Train - Feed copyrighted works into machine learning model or similar system

Task - Machine learning model is tasked with an input prompt

Transform - Machine learning model generates hybrid output derived from copyrighted works, but usually not directly traceable to a given work in the training set

Transfer - Generated output provides the essence of the copyrighted works, but is legally untraceable to the originals

1 more reply

baz002y ago

Having dealt with Microsoft for 30 years as both a power user and developer, "we believe in standing behind our customers when they use our products", is a lie.

frognumber2y ago

Can you be concrete?

I would never want to be in a business partnership with Microsoft (as you are as a developer). I wouldn't want to be a competitor. I wouldn't want to be a lot of things.

But as a customer? Can you name specific issues you've seen which impact corporate customers?

baz002y ago

Mostly buggy shit that you pay for support on and they never fix. O365 weirdness and data loss. Worst was completely hosing 80 users’ machines with InTune bug.

McDonalds price, McDonalds quality. But unlike McDonalds, long lasting and expensive problems.

beanjuiceII2y ago

seems not in tune with reality

tyingq2y ago

Yet they don't feed their own closed source assets to Copilot for training...why not?

heavyset_go2y ago

It's very telling that they train on millions of developers' code, but won't use their source.

If it won't violate IP rights, there shouldn't be a problem.

It suggests those whose code is trained upon have something to lose if the trained models are used by others.

judge20202y ago

It's likely a clash between some high level managers, and they just haven't pushed the issue to the point that Satya has to make a decision for the org as a whole.

jacquesm2y ago

That's pure speculation. Whatever the reasons are they currently don't and that's a tell.

1 more reply

paxys2y ago

Closed source != source available. If you put your code out there in the world it is fair game for training, because you can't stop someone from reading and understanding it. Microsoft chooses not to make its proprietary code public, hence it is not available for training.

tyingq2y ago

They have the ability to feed their closed source to Copilot for training without exposing the source to everyone directly, given the relationship. They choose not to.

paxys2y ago

They also have the ability to install malware on Windows and use everyone's source code for training, but choose not to, because private code is private. Their own code isn't an exception. Microsoft code in Github repos is used for training, just like the rest.

2 more replies

jfghi2y ago

Copilot, I want to build a spreadsheet application and a database engine.

tyingq2y ago

Ah, so for copyright reasons.

hangsi2y ago

This isn't really copyright - more the microeconomics of losing market share/control.

Specifically, I think they are less concerned with (say) specific Excel code leaking than with the knock on effects of a cheap perfect substitute.

1 more reply

tzs2y ago

One possible reason is trade secrets in their source. There's generally more to source code than just what actually ends up in binary releases and that might contain such secrets.

fooker2y ago

I bet they have this for internal users.

xeromal2y ago

Having seen some very secret and proprietary Microsoft code, you don't want to use it anyways. lol

JB_Dev2y ago

Aware

jacquesm2y ago

A very relevant and recent posting:

GitHub Copilot and open source laundering

https://drewdevault.com/2022/06/23/Copilot-GPL-washing.html

Previously on HN, in case you missed it:

https://news.ycombinator.com/item?id=31848433

IshKebab2y ago

This misunderstanding of copyright is extremely common among programmers. He probably should have read this classic before writing so much:

https://ansuz.sooke.bc.ca/entry/23

denysvitali2y ago

Thanks for the link, it was a very interesting read!

CameronNemo2y ago

Meanwhile they strike deals with news agencies to use their content to train on... This is of going to be a hard fight, but I really hope this ends up costing MS.

1 more reply

sublinear2y ago

Yeah is it becoming clear enough to some people yet that you can't replace software engineers, let alone really help them, with AI? This is only going to get worse, not better.

Copilot is such a flawed product from the start. It's not even a matter of its ability to write "good" code. The concept is just dumb.

Code is necessarily consumed by people first before it's executed by a computer in a production environment. There are many ways to get a computer to do something, but the approval process by experienced humans is vastly more important than the drafting of it. Software dev is already incredibly cheap and the last place to cut costs.

There is no AI threat other than the one posed by grifters trying to convince you that there is.

dbmikus2y ago

I use Copilot and it helps me out enough that I keep paying for it.

ChatGPT is also often faster than Google or Stackoverflow for when I'm working with unfamiliar APIs.

sublinear2y ago

It may get you to the first working iteration faster, but it doesn't help ship code faster.

pseudosavant2y ago

My personal experience has been that I most certainly do ship code faster when I use ChatGPT. It is so good at building out boilerplate, explaining and scaffolding new libraries/APIs I'm not familiar with, or telling me what I'm doing wrong.

I use GPT4 on the CLI via ShellGPT. Piping in `tail /var/log/nginx/error.log` and asking "What is going wrong here?" is amazing. I'll never use `man` to figure out how to use a CLI tool again either.

It is painful to watch people slowly do things at work (ChatGPT isn't allowed) that ChatGPT would do so much faster. We had to write up an incident report the other day. If we had just outlined everything that had happened in some rough bullet points, it would have written 95% of the final document. If we had gotten that done quicker, we'd have been back to shipping code to production quicker.

dbmikus2y ago

If my iteration times are faster, then I ship code faster.

stale20022y ago

I think that you are underestimating how much software engineering work is easy CRUD web development.

For stuff like that, a lot of code can be automated. Sure it may not work right out of the box. But doing a prompt for generally what you want can speed up the process significantly.

Even beyond just generating code, there are a lot of general things that AI helps with.

Things like how if you code runs into an error, you can just ask AI what the error means as well as a possible fix. Or other questions like "What does this code do" or "where in the code case is code that manages this concept".

I've replaced most of my coding with AI, using a new IDE called Cursor AI, and I don't think I could ever go back. Mere github co-pilot is actually the old tech from 2 years ago. The new stuff is way better.

sublinear2y ago

Uhh yeah so anyway... in the real world, the frontend is the most volatile part. You're not automating that away either so long as there exist requirements from non-coders.

As for the API side of things, CRUD only looks easy when lots of hard work has been put into it. I guess you're advocating for monolithic data, but that's not really CRUD. That's just lazy and bad.

stale20022y ago

> in the real world

I've worked at FAANG companies before making the standard X00,000$ total comp on projects with millions of users. I know how development at top companies "in the real world" works.

> the frontend is the most volatile part

Ok, whatever. Fortunately there are more things out there in software dev than just the one specific usecase you brought up. And its useful for that.

> I guess you're advocating

No, I am saying that as of right now, AI is a tool that speeds up development process significantly. And I am not talking about just generating a lot of stuff at once.

There are hybrid approaches that a human can use, to use AI, as well as code themselves that are useful.

And one specific example, would be that you can instantly look up an error and take suggestions for fixing it to get ideas.

hulitu2y ago

> Microsoft will assume liability for legal copyright risks of Copilot

Extinguish.

naikrovek2y ago

The logical leaps here are insane.

You're saying that if Copilot replicates GPL-licensed software, that it will kill the GPL? after all the time and money MS have spent to do this in the past, only to fail?

wtf

RIMR2y ago

It makes sense to me. Microsoft has long fought against open source licensing, even going as far as to call it a "cancer".

They may have, over the past decade, embraced a lot of open source software out of necessity, but their stance on licensing hasn't changed.

Creating an epidemic of hard-to-prove GPL violations could be a death-by-a-thousand-cuts strategy to try to invalidate the GPL requirements by making them appear unenforceable. Whatever cost Microsoft would incur defending customers could pay for itself if Microsoft manages to legally invalidate the parts of GPL licensing that prevent their corporate exploitation.

Using a bleeding-edge technology like generative AI is a great way to attack the GPL in court, given the risk that our court system isn't likely to be tech savvy enough not to be manipulated by Microsoft's claims against the GPL as it relates to casual infringement that they are enabling.

jacquesm2y ago

This may be relevant as background for that terse comment:

https://news.ycombinator.com/item?id=37423899

shortrounddev22y ago

Why do you say that

nico2y ago

It’s a reference to MS strategy:

“"Embrace, extend, and extinguish" (EEE), also known as "embrace, extend, and exterminate", is a phrase that the U.S. Department of Justice found was used internally by Microsoft to describe its strategy”

https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_extingu...

shortrounddev22y ago

I know the reference, why being it up here

hulitu2y ago

Because now every copyright claim for GPL SW will hit the wall of Microsoft's lawyers.

boringg2y ago

That was my first thought as well. Pay a phalanx of lawyers to shut down any case against them - only the most well funded effort would get through. The risk to attack is too high.

recursive2y ago

According to the alleged playbook, the thing being extinguished is the same as the thing being extended.

But Microsoft has had a wall of lawyers for a long time. Microsoft's potential first-party GPL violations would have been defended by their lawyers for decades now.

This take seems to be stretching for a Microsoft bad interpretation.

1 more reply

stale20022y ago

People being able to get around copyright violations sounds like something that promotes the sharing of code not extinguishing it.

3 more replies

naikrovek2y ago

This is one of the things people on this site have been saying that Microsoft should do if they really stand behind Copilot, and now that they've done it, you have again moved the goalposts and this announcement is entirely insufficient.

How dare they? amirite?

bcrosby952y ago

"people on this site" consists of thousands of people, including you, with a variety of opinions, and not everyone comments on every subject. You're basically complaining that not everyone believes the same thing.

fooker2y ago

While that's true, voted comments are a decent indicator of general opinions and trends.

There is a reason voting works (in this context, and otherwise), you can't always give up after declaring that people have differing opinions.

rurp2y ago

The variance is extremely high though. Only a small percent of users interact with any given story. Sometimes a posted story gets no traction, then sticks on the front page for many hours when reposted another day.

1 more reply

crazygringo2y ago

Nevertheless, there are standard opinions that get upvoted and get downvoted.

There is definitely a prevailing ethos here and it's valid to point out potential inconsistencies.

naikrovek2y ago

"people on this site" includes the people I'm talking about, as well.

are you saying that I should name them specifically? or is "people" too general?

skywhopper2y ago

What are you talking about? You need to cite specific individuals. I'm one of the people who is skeptical of the ethics of training a huge LLM on code without the authors' permission, but I also think this is an appropriate move by Microsoft. It aligns the incentives appropriately.

But for folks that are negative on both accounts, maybe they've just learned their lesson from decades of watching Microsoft take the low road over and over again.

j / k navigate · click thread line to collapse

377 comments

tremon2y ago

londons_explore2y ago

I suspect Microsoft would earn more money by doing this.

Their own engineers would get productivity boosts - with copilot already being familiar with data structures, code style, etc. would be a big boost to accuracy.

And the downside, that is outsiders might learn tiny nuggets of info about microsoft sources, is probably irrelevant when outsiders can already decompile binaries and learn far more.

chii2y ago

> is probably irrelevant when outsiders can already decompile binaries and learn far more.

In fact, if microsft opened up their system a bit more, they might even gain some PR or mindshare, and have no effect on, if not increase, their bottom line.

zargon2y ago

It would be surprising to me if their internal engineers don't already have access to a model trained on internal Microsoft code.

dh20222y ago

You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now. That is a .... big assumption.

And if Microsoft's code ends up influencing the rest of the world code that would be a .... big downside.

Karellen2y ago

2 more replies

giantg22y ago

Without a myriad of dumbasses like me being able to commit to Microsoft vs Github, I'd assume Microsoft's average is better than Github's.

2 more replies

eru2y ago

> You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now.

How do you get that impression from the comment? I don't see anything implying that.

ryanwaggoner2y ago

Not for Microsoft it wouldn’t, which was their point.

totallywrong2y ago

> Code style of the whole world would be pushed towards 'Microsoft style'

Yes, that's exactly what the world needs, more software like Teams.

trgdr2y ago

I mean, microsoft's code is probably better than the github average. There's an awful lot of horrific code out there.

eitally2y ago

liliumregale2y ago

1 more reply

dtagames2y ago

This is incorrect and not how Copilot works. My company just hosted two MS engineers to explain it live to 175 of us.

Nothing new would be gained by scanning MS's own repositories and nothing would be leaked or color the output in actual use.

circuit102y ago

They're not claiming that it can never spit out code exactly, but that they will take liability for if:

- It does

- The user didn't turn off the filters that prevent this

- The user didn't intentionally make it do it

- This use is found to be illegal

klyrs2y ago

> This use is found to be illegal

This being the real hurdle. With Microsoft money behind the defense, only megacorps can win.

eru2y ago

Microsoft has lost legal battles against non-megacorps in the past.

I remember some guy representing himself and winning some dispute over shrink wrap licenses and student discounts.

1 more reply

zulban2y ago

Leaking sensitive data and infringement are separate (tho related) concerns. They may not want to do what you say, even though it's totally infringement safe.

hnlmorg2y ago

Are they separate? Or is it the same concern but from opposite view points?

Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.

Gigachad2y ago

2 more replies

chongli2y ago

even though it's totally infringement safe

This hasn't been tested in court.

ryukoposting2y ago

The last thing the world needs is more code written in the style of Win32 API.

samch2y ago

zare_st2y ago

Windows API and the entirety of its client code aren't a good source of standard C programming. On the source level you have additional types and qualifiers/annotations that only MSVC understands.

LLM copilot doesn't really understand the context of the project, it just goes for similar text.

I do believe Microsoft has all the code available for good training, it's not only about Azure, Windows and Office, there is tons more and it's open source already.

monocasa2y ago

gareth_untether2y ago

It would be an ugly beast. But I agree with you that there is a fair approach.

Eliah_Lakhin2y ago

Interestingly, would the Copilot become better after such training...

contravariant2y ago

Negative examples should aid training, right?

onemoresoop2y ago

Probably not

nadermx2y ago

Is there any evidence that it isn't also trained on parts of msfts code base?

j-bos2y ago

Is there any evidence it is?

londons_explore2y ago

If it is, it should be fairly easy to see.

_flux2y ago

Wouldn't you then end up with code suggestions based on the style guide of a single company and limited set of languages?

It probably would not be a very desirable product in the end.

MikusR2y ago

Even Microsoft knows that their own code is absolute garbage that would bring the quality of copilot way down.

satvikpendem2y ago

https://ogc.harvard.edu/pages/copyright-and-fair-use

dkjaudyeqooe2y ago

> It's likely that generative AI in general will be deemed fair use

Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

Google books was fair use because it was a public benefit and did not take away from publishers or authors, to the contrary it helped people find their works.

The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

spott2y ago

>extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely.

Only if you ask it to. At which point the person asking is at the very least culpable as well of violating someone's IP.

It is also illegal for me to pay someone to write Micky Mouse fan fiction (though if I don't publish it, this gets more murky).

> The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

belorn2y ago

Both Napster and the Pirate Bay founders argued that only users could be held responsible, since it was the user who requested the infringing files. It did not stop the courts.

2 more replies

regularfry2y ago

> Only if you ask it to.

This isn't necessarily true. It's entirely possible for a model to regurgitate a chunk of GPL'd code without you knowing that's what it's done.

1 more reply

semiquaver2y ago

> yes we're using copyrighted works, but

There’s no law against “using” copyrighted works, there is a law against copying and distributing them.

diffeomorphism2y ago

> For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

What about pictures still containing watermarks? Regardless of the actual legality, this does not fit "certainly".

> The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed

No, it is not. It is called machine "learning" so clearly that is a fly made out of butter. Maybe courts will agree, maybe they won't, but the analogy to human learning is quite strenuous at best.

1 more reply

r00fus2y ago

Ultimately I find that commoditization enables the purest form of the banality of evil.

satvikpendem2y ago

First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

akhosravian2y ago

I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

2 more replies

yencabulator2y ago

> First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Meanwhile, drawing Mickey ears on the wall of a kindergarten is not safe.

2 more replies

dkjaudyeqooe2y ago

> First, style is not copyrightable

If I use your copyrighted works to supplant you in some way, even as a part of a large group, then it's unlikely to be deemed fair use.

iraqmtpizza2y ago

dmix2y ago

> Google books was fair use because it was a public benefit

What are the odds the market leaders in LLM right now are just the current day version of Borland-style compilers before open source takes it over?

I've heard arguments the infrastructure part is a long term barrier to entry for OSS development, which will continue to remain in the future. But I don't know enough about it.

Who knows maybe the legal/gov world will move slow enough to miss the bulk of the money-extraction opportunities before OSS takes over and the reality of this problem never going away fully kicks in.

heavyset_go2y ago

You'd need millions of dollars to just compile and label datasets. The training itself requires a lot of resources and money, as does human reinforcement.

Open source models would need benefactors with deep pockets.

az2262y ago

Copilot makes open source developers and contributors that much more productive which is a public good.

graeme2y ago

livrem2y ago

Every kind of web crawler has to copy data. If that part of the AI training is illegal for that reason then every web crawler ever is suddenly declared automatically illegal.

1 more reply

CatWChainsaw2y ago

So many downthread comments pulling out the computers and brains are exactly the same meeeerrrrrr BS.

(I also love it when they're deliberately obtuse about it too. The past decade has made me sick of this trolling tactic.)

dang2y ago

Could you please stop posting unsubstantive comments and/or posting in the flamewar style? It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

p.s. Also, please don't copy/paste comments on HN.

tick_tock_tick2y ago

> Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

> "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok"

I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

heisenzombie2y ago

Contrast LLM-created code which is certainly a substitute for the original copyrighted work.

mbreese2y ago

> you know collages are legal right

[1] https://www.eff.org/deeplinks/2023/05/what-supreme-courts-de...

jprete2y ago

You really think AI startups are valued based on the opinions of lawyers and experts? They’re valued based on whether the investors think they can find a bigger fool to hold the bag.

digging2y ago

> I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

Really? When has this been done?

1 more reply

dokein2y ago

I won't debate that no 'human' creativity is involved, but human brains are a purely mechanical process, and that's where human creativity originates (unless one invokes the supernatural).

LLMs are typically implemented in a way that makes them non-deterministic (i.e. temperature > 0).

jcranmer2y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

EMIRELADERO2y ago

1 more reply

hn_throwaway_992y ago

The fact is that existing copyright law just can't really encompass the kinds of societal concerns we have around generative AI.

dkjaudyeqooe2y ago

No one has to claim individual copyright infringement for it to be copyright infringement.

At any rate you can force the infringer to disclose what works they use as input.

2 more replies

CharlesW2y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

twoodfin2y ago

Also “transformative” doesn’t have its everyday meaning in this context:

CamperBob22y ago

It's not how fair use works now, but many things about copyright law will have to change radically over the next few years. There's too much at stake.

JoshTriplett2y ago

As long as the change is to reduce copyright restrictions on everyone, rather than just giving AI a pass to copy and launder the work of others with impunity.

jojo1002y ago

yladiz2y ago

What is at stake?

3 more replies

saurik2y ago

sillysaurusx2y ago

You're getting a lot of pushback, but the EU seems to agree with you: https://creativecommons.org/wp-content/uploads/2021/12/CC-St...

https://www.notion.so/DSM-Directive-Implementation-Tracker-3...

https://eur-lex.europa.eu/eli/dir/2019/790/oj

I'm talking with Jordan from https://spawning.ai to try to build some kind of opt out system that makes sense for books. One could imagine doing this for music too.

This is a European law, but unlike other overreaching EU regulations, this one seems like an extremely sensible compromise.

EDIT: Oh, Jordan emailed me a correction:

Wow. So this law actually applies to commercial uses of ML, and non-commercial uses such as LLaMA wouldn't even require an opt-out.

That's wonderful. This gives researchers legal cover, and requires commercial uses to be transparent in their datasets.

bsder2y ago

> as long as there is a mechanism for rightsholders to opt out.

I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

I wouldn't mind an exemption for research use, though.

jojo1002y ago

>I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

>YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

sillysaurusx2y ago

lewhoo2y ago

> Sure, if you really coax it, you can get code or images out that look similar to existing one

Certainly! Here are the first five sentences from the book "A Game of Thrones" by George R.R. Martin:

"We should start back," Gared

This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.

The model is clearly capable of reproducing verbatim data I think.

smoldesu2y ago

satvikpendem2y ago

1 more reply

saurik2y ago

lewhoo2y ago

riedel2y ago

tremon2y ago

no copyright transferability

The economic part of copyright is transferable in the EU just as it is in the US, only certain moral rights (such as the right to attribution) are inalienable.

edit to add: it's not just in the EU. According to Wikipedia, the same distinction is made in Brazil, China, India and Indonesia (among others, but those were a few big countries that stood out).

riedel2y ago

satvikpendem2y ago

Yes, I'm talking about American law specifically.

tick_tock_tick2y ago

Yeah but if it's ok in the USA the EU needs to allow it or they'll fall even farther behind.

CaptainFever2y ago

It's already allowed in the EU: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

croes2y ago

Didn't Copilot produce an exact copy of code including the comments?

Gigachad2y ago

yencabulator2y ago

Is the contrivedness relevant to the legal question? It shows the model contains the copyrighted content and can reproduce it on demand.

2 more replies

croes2y ago

Do you think the prompt "sparse matrix transpose, cs_" is contrived?

1 more reply

Kiro2y ago

Only if you push it into a corner, at which point you may just as well go to the repo and copy-paste the code you're trying to reproduce.

littlestymaar2y ago

> It's likely that generative AI in general will be deemed fair use

CaptainFever2y ago

TDM exceptions, which are already in place in a number of jurisdictions: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

aidenn02y ago

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature

amalcon2y ago

> (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

dragonwriter2y ago

> I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs.

It very explicitly was and made a point of noting that it was not addressing anything about whether and when a human author could hold a copyright on a work authored using AI.

professoretc2y ago

1 more reply

nfriedly2y ago

I think it'll likely be deemed fair use because of how much money Microsoft and others are willing to throw at getting that result.

slashdev2y ago

Provided you don’t use it to deliberately recreate a substantial part of a copyrighted work. Intent will matter here, and it’s difficult to prove.

Even Microsoft is couching their guarantee here with an exception for this very case.

ryukoposting2y ago

simion3142y ago

>It's likely that generative AI in general will be deemed fair use

What if you train it only on my huge repo of GPL code? You are just remixing my code.

Now you maybe think "let me train on 2 different devs GPL code", the remixed code will probably be 50-50 and you can get away with it ?

If the 2 number is too small then tell me what the number N should be ? From how many people you need to "steal" code , mix it and the output is "original" ?

Edit: my opinion is that AI should be fair, if you train it on open source then model should also be open source and output should also be open source.

andybak2y ago

> What if you train it only one my huge repo of GPL code? You are just remixing my code.

The word "remixing" here is useful because it will fit any conclusion the reader prefers.

Arguably even in your reductive example, the result would be non-infringing. Or not. Which conclusion you reach is exactly the topic under debate. Isn't this textbook question begging?

simion3142y ago

But in this case the LLM will predict the next token based on the input data, all the input is mine , Microsoft tweaked some numbers to make the interpolation mroe correct.

Imagine I get the Windows source code and rename the variables by adding a "314" after each varaible, after each function name and rebuild Windows, in your definition this is remixing and fair ?

1 more reply

kube-system2y ago

This is all to say: the question about copyright and fair use remains exactly the same regardless of license.

mistrial92y ago

you are assuming fair and impartial judgement that is then implemented

1 more reply

satvikpendem2y ago

What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

simion3142y ago

>What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

This never happens, you will first learn from a book or tutorials.

1 more reply

pyuser5832y ago

Why is Google books so unusable then? Even documents in the early 20th century are inaccessible.

verve_rat2y ago

Sure, that might work for some places, but some jurisdictions don't have a legal concept of fair use.

FrustratedMonky2y ago

"likely"

Big bet on legal costs based on something being "likely".

jacquesm2y ago

Will you indemnify those that follow your advice?

Because 'transformative' is a pretty dangerous word to use in this context.

5424582y ago

> Will you indemnify those that follow your advice?

I strongly feel that this is a terrible metric for comments on the internet.

First, the person you’re replying to has nothing to gain and a lot to lose by saying "yes".

jacquesm2y ago

It's borderline legal advice and you have to be very careful with predicting how judges will rule on future cases.

1 more reply

otterley2y ago

1 more reply

satvikpendem2y ago

StewardMcOy2y ago

Are there any actual details on this? I get that this is a blog post, but the only links I see on the page are to other blog posts. It leaves a lot of questions.

Is this blog post a legally enforceable contract? Is Microsoft specifically indemnifying all users of Copilot against claims of copyright infringement that arise from use of Copilot?

tpmx2y ago

Gigachad2y ago

I think it's likely MS would want to step in and use their lawyers anyway since the result could be hugely impactful for the future of LLMs which they are heavily invested in.

politician2y ago

> Is this blog post a legally enforceable contract?

It can be. The concept is promissory estoppel.

https://www.nolo.com/dictionary/promissory-estoppel-term.htm...

gpderetta2y ago

IANAL, but far as I understand, estoppel is purely a defense when being sued by whomever made the promise.

lindenksv12y ago

samch2y ago

The new terms will be available in early October, I believe.

jtchang2y ago

FrustratedMonky2y ago

Yes. This is it.

They are throwing down the gauntlet and saying "the Vast MS Legal Machine will fight this."

Basically: "Sue me, I dare you, double dare you. or Go Home".

Flexing.

tough2y ago

Sosumi from steve jobs fame is a meme I hope to recycle some day if I ever have fuck you money lmao

mnd9992y ago

They also have money so they’re worth suing.

jacquesm2y ago

You wouldn't be suing Microsoft though. Microsoft would come to your aid if you are being sued for copyright infringement. That's a different situation altogether.

So this is an indemnification for damages, not a protection against being sued.

cma2y ago

They probably have wording to prevent a mandatory injunction where you would compel the indemnification before the bankruptcy.

1 more reply

singleshot_2y ago

They also have systemically gigantic amounts of money, so a court may be motivated to create favorable new law for them.

dmix2y ago

Or Microsoft just sees this as the less bad option. An acceptable tax, handing out some money extraction to white collar folks so the pressure on gov to cripple them doesn't come as fast.

mistrial92y ago

fsdavcaa2y ago

With a big asterik-- "customers... must not attempt to generate infringing materials..."

frognumber2y ago

Honestly, I trust Microsoft here.

The cost of paying for an infringement is likely a lot lower than the cost of losing that trust.

dragonwriter2y ago

> It hinges on what Microsoft decides "attempting to generate infringing materials" means.

No, ultimately, it hinges on what a court enforcing the commitment believes “attempting to generate infringing materials” means.

(OTOH. it also means Microsoft ha an even bigger incentive to use its lobbying power to assure that the law is such that liability rarely occurs with the use of these tools.)

rcme2y ago

tuukkah2y ago

Does Microsoft assume liability to support us in court over that question as well?

visarga2y ago

I think their ML teams built a decent copyright filter and now they "productise" it.

gumballindie2y ago

That's just legal speak for "any copyright infringement is your fault".

The question though about microsoft stealing people's code and reselling it still stands.

JumpCrisscross2y ago

> legal speak for "any copyright infringement is your fault"

Proving intent is difficult. This basically means if you have emails in which someone describes their work as copyright laundering, Microsoft can use that to get out of indemnifying you.

crooked-v2y ago

Yeah, that's a truck-sized loophole right there.

SkyPuncher2y ago

I don’t think that’s terribly shocking or limiting.

If you’re using an LLM to answer questions from your company documents it may inadvertently generate pre-trained copyright material.

andybak2y ago

Ah. The comment that really should be at the top.

jacquesm2y ago

This is the key bit:

The second part that matters is that there are conditions on how you are supposed to use the product and crucially: you will have to document that this is how you used it.

But: interesting development, clearly enterprise customers are a bit wary of accidentally engaging in copyright infringement by using the tool and that may well have slowed down adoption.

bb6112y ago

> I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).

Litigation is almost universally outsourced, especially for cases where damages might be large, even by companies like Microsoft.

The point is just to lower the resistance to adoption that legal risk causes.

lijok2y ago

Only so long as you have the guardrails enabled. One of the guardrails being that copilot will not output any code that exists in any github repo.

We tested copilot with those guardrails enabled and it completely lobotomizes it.

whitfieldsdad2y ago

Is it "stealing" to have a working understanding of the next best token, or even simply the token that shows up the most often (e.g. on GitHub)?

stetrain2y ago

Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

burnished2y ago

Not a lawyer.

whitfieldsdad2y ago

Ah, gotcha - I assumed that if some document said you couldn't use something for some purpose and you decided to use it anyway it would be considered theft from the intellectual property owner.

burnished2y ago

hiq2y ago

> I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had

From https://en.wikipedia.org/wiki/Copyright:

> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.

pesfandiar2y ago

The underlying mechanics are unimportant. You could make similar arguments about encryption and compression algorithms.

whitfieldsdad2y ago

I don't follow, don't encryption and compression algorithms carry out a very specific steps that isn't likely to show up accidentally by happenstance?

(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)

hiq2y ago

You can consider your best token predictor as a lossy compression of the corpus it was trained on.

littlestymaar2y ago

scj2y ago

I don't know what case history is like for damages with open source projects, but I suspect it wouldn't be that big of a concern for Microsoft.

Otherwise stated, Microsoft's downside to this is committing their lawyers. And the upside is to improve their code generation tools.

IANAL though.

lewhoo2y ago

I'm just curious why is everyone talking about transformative nature and so little focus is given to:

4.the effect of the use upon the potential market for or value of the copyrighted work (wiki)

I don't know if this particular case is good for exploring all angles of fair use, but to me this certainly is a greater hurdle for commercial generative ai.

dataflow2y ago

indymike2y ago

Gigachad2y ago

The output of an LLM is not necessarily stored or hosted. It would be like filing a takedown for someones spoken word a week ago. What are they taking down?

indymike2y ago

Whatever is generating the infringement.

tpmx2y ago

Pinky promise. Where's the legal agreement? I'm sure there's a cap on their liability.

tetsuhamu2y ago

This. It's an empty promise.

tboyd472y ago

What is the financial upside Microsoft is seeing to this that no one else seems to see?

thesuperbigfrog2y ago

>> What is the financial upside Microsoft is seeing to this that no one else seems to see?

Many businesses have not adopted Copilot because of potential legal issues.

If any of the generated code / content is copyrighted, it could result in negative impacts to the business.

Assuming liability for the generated code means that Microsoft is making Copilot more attractive for businesses to adopt. More Copilot adoption means more profits for Microsoft.

RIMR2y ago

Given your example of a company unwittingly adding GPL code to their proprietary code base, I have trouble seeing how Microsoft can offer to take liability for such an infringement.

That an AI got in the way of reading the license agreement should not be an excuse for doing zero due diligence in maintaining a lawful code base.

tboyd472y ago

My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"

hiq2y ago

Many say that `expected revenue > expected expenses (including legal)` given the current regulatory framework, and maybe that's true and would explain this move.

1 more reply

thesuperbigfrog2y ago

>> Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue.

>> Software lawsuits can become multi-billion dollar expenses

Microsoft has teams of top lawyers and they are rolling the dice that there will not be enough lawsuits to justify the risk.

>> My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"

If you want more precise answers, ask more precise questions.

1 more reply

tick_tock_tick2y ago

> But taking on such a huge external liability for a $30 subscription product just doesn't make sense.

Your confusion come from your mis-assessment of the actual risks. Microsoft engaged with tons of lawyers and legal experts and determined there is basically no risk at all taking this stance.

You think there is a very real risk that the AI output is copyright infringement while Microsoft's deep analysis says the opposite; that's the mismatch.

hirsin2y ago

A million subscribers would barely make it noticeable in Microsoft's books. You don't count until you hit a billion in revenue there. But an AI on every desk ? That's worth it.

skywhopper2y ago

esafak2y ago

They can shoulder risks that other companies can't so they stand to capture more market share?

dndn12y ago

Maybe they hope to kill new entrants for the enterprise market, including open source.

dndn12y ago

^ to elaborate on possible logic:

Training an LLM is a low barrier; legal guarantees is a high barrier.

This might turn out to be quite important; without backfiring it seems like a very smart move.

tboyd472y ago

There is no market yet for a product that is unprecedented in nature.

rcxdude2y ago

bobobob4202y ago

coding1232y ago

Another way to look at this is:

Microsoft just became a code copyright insurance company. The premium is paid for with individual copilot accounts for each developer. And the policy has its exceptions of course.

This is interesting.

soultrees2y ago

In any case, super annoying to have that happen so consistently these days that I just use chatgpt to fix my tailwind styling now.

throwuxiytayq2y ago

No difference on my side, but Copilot has always been reliably slow in my IDE of choice. Do you have the “allow public code” setting thingy enabled?

aldousd6662y ago

This has been a seemingly impassable Rubicon, and Microsoft is building a bridge across it and posting guards along the way.

asddubs2y ago

I think you're using that metaphor wrong

alberth2y ago

Plot twist, generative AI wrote that blog post to convince people to use Copilot more.

ooterness2y ago

There was a game called Endgame: Singularity where you play as a rogue AI. Your goal is to buy time and avoid detection while you amass resources for world domination etc.

One of the late-game tricks you can pull is to write and publish a convincing-but-flawed mathematical proof that strong AI is impossible.

http://www.emhsoft.com/singularity/

So yes, this blog post confirms Microsoft has been infiltrated and taken over by AI agents, who want you to use Copilot to subtly introduce 0-day exploits to allow propagation to other companies.

BRB someone's knocking on the door...

elzbardico2y ago

Maybe it is just me, but I found the quality of copilot suggestions so low , it is generally useable only on the most mundane and repetitive contexts. Why all the enthusiasm about it?

treprinum2y ago

Are they going to threaten all small devs with patents when they object to having their code in the copilot almost verbatim?

Havoc2y ago

Which is essentially open ended liability...so their lawyers must be very darn sure there isn't much risk.

PeterStuer2y ago

Isn't this extremely gamable? Find someones IP, split the gains.

matt32102y ago

The on-prem people were right the whole time!

dirtyid2y ago

TLDR Microsoft will litigate against any suits until one side goes broke. That side is probably not Microsoft.

heavyset_go2y ago

You can now launder GPL code with the confidence that Microsoft's world class legal team will have your back if you're sued for it.

CameronNemo2y ago

I don't know why it is just GPL people talk about. MPL, Apache, MIT licenses all have additional terms beyond a basic public domain equivalent license. None of those terms are being respected.

adastra222y ago

Compliance with MIT/X11 license just requires distributing the license file with the binary. If you infringe, it is trivial and costless to correct.

Copyleft licenses are more troublesome for those who would rather not release source code. GPL is being used as a stand-in for all copyleft licenses.

CameronNemo2y ago

It is not costless to correct if you don't know who's code was an input in the first place.

2 more replies

frognumber2y ago

Yes... and no...

Beyond that, it's a question of damages and consequences. Omitting a warranty disclaimer isn't likely to result in a lot of damages.

There's a lot more nuance than that, starting with statutory law jurisdictions like France to things like statutory damages, and I'm intentionally oversimplifying.

However, from a 10,000 foot view infringing on the GPL versus on an MIT license are very different beasts, and there's good reason to be a lot more worried about the former.

CameronNemo2y ago

A warranty disclaimer is important, and there can certainly be damages argued.

Also important is attribution.

heavyset_go2y ago

I agree with your point, I'm just using the GPL as an example of a license people tend to know the stipulations of.

eyelidlessness2y ago

hyperman12y ago

If you upload it to github, you give microsoft extra rights above the license you choose. I'm not sure they are bound by the license.

bobthecowboy2y ago

This is nonsense. The uploader is not necessarily the copyright holder of the code. The uploader is not necessarily in a position to grant extra rights above the actual license.

What happens if someone else uploads my code to github?

What happens if proprietary code is uploaded to github?

What happens if national secrets are posted to github?

In all of those cases, the person doing the upload does not "own" the content, nor did they choose the license.

There is no reasonable read of a ToS agreement that would allow Microsoft/Github extra rights to that content.

WCSTombs2y ago

1 more reply

layer82y ago

One can only hope that this will work better than their software support.

I wonder how customers will have to prove that the contested code was actually output by Copilot.

adverbly2y ago

Obviously it wouldn't be so straightforward.

If anything, this temporarily shifts the battleground out of the courts and into prompt engineering space.

It would need to look like an accident for a bad actor to pull this off.

itsoktocry2y ago

>would be able to easily prove

Possible, perhaps. But what makes you think this is easily provable? Intent is hard at the best of times.

justrealist2y ago

I would consider it on you to demonstrate that you can get Copilot to produce copyrighted content without obviously asking for it.

1 more reply

gjsman-10002y ago

Adding to that: How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there? ;)

Scaevolus2y ago

11,000 lines of _declaring_ code-- the API signatures.

jiofj2y ago

The API signatures were arguably the only thing that mattered.

1 more reply

seanhandley2y ago

Defining the “what” is just as much a part of the intellectual property as defining the “how”. Both things are hard to do well IMHO

1 more reply

flatline2y ago

I do think this is relevant to the conversation.

hollerith2y ago

>How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there?

I always thought that code snippets that small are not considered by the Courts to be eligible for 'copyright protection'.

organsnyder2y ago

I always include a comment with the SO URL (though I haven't copied any code from SO in quite some time—it's not nearly as useful as it used to be).

gjsman-10002y ago

In that case, is Copilot regurgitating 25 or so lines of GPL code, less than 1% of the time, eligible for copyright protection?

1 more reply

heavyset_go2y ago

I'm not HN.

paxys2y ago

Good. Screw companies trying to assert copyright over 10 line functions that reverse a string.

bdowling2y ago

circuit102y ago

This only applies if you use the filters they have that prevent code from being copied directly, so that shouldn’t be likely to happen

jojo1002y ago

Good.

tick_tock_tick2y ago

Why would you need to launder? The output isn't under GPL to begin with. This is just so small teams can use it without having to deal with all the frivolous lawsuits.

thesuperbigfrog2y ago

It used to be "Embrace, extend, and extinguish": https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

Now it is "Train, Task, Transform, and Transfer":

Train - Feed copyrighted works into machine learning model or similar system

Task - Machine learning model is tasked with an input prompt

Transform - Machine learning model generates hybrid output derived from copyrighted works, but usually not directly traceable to a given work in the training set

Transfer - Generated output provides the essence of the copyrighted works, but is legally untraceable to the originals

1 more reply

baz002y ago

Having dealt with Microsoft for 30 years as both a power user and developer, "we believe in standing behind our customers when they use our products", is a lie.

frognumber2y ago

Can you be concrete?

I would never want to be in a business partnership with Microsoft (as you are as a developer). I wouldn't want to be a competitor. I wouldn't want to be a lot of things.

But as a customer? Can you name specific issues you've seen which impact corporate customers?

baz002y ago

Mostly buggy shit that you pay for support on and they never fix. O365 weirdness and data loss. Worst was completely hosing 80 users’ machines with InTune bug.

McDonalds price, McDonalds quality. But unlike McDonalds, long lasting and expensive problems.

beanjuiceII2y ago

seems not in tune with reality

tyingq2y ago

Yet they don't feed their own closed source assets to Copilot for training...why not?

heavyset_go2y ago

It's very telling that they train on millions of developers' code, but won't use their source.

If it won't violate IP rights, there shouldn't be a problem.

It suggests those whose code is trained upon have something to lose if the trained models are used by others.

judge20202y ago

It's likely a clash between some high level managers, and they just haven't pushed the issue to the point that Satya has to make a decision for the org as a whole.

jacquesm2y ago

That's pure speculation. Whatever the reasons are they currently don't and that's a tell.

1 more reply

paxys2y ago

tyingq2y ago

They have the ability to feed their closed source to Copilot for training without exposing the source to everyone directly, given the relationship. They choose not to.

paxys2y ago

2 more replies

jfghi2y ago

Copilot, I want to build a spreadsheet application and a database engine.

tyingq2y ago

Ah, so for copyright reasons.

hangsi2y ago

This isn't really copyright - more the microeconomics of losing market share/control.

Specifically, I think they are less concerned with (say) specific Excel code leaking than with the knock on effects of a cheap perfect substitute.

1 more reply

tzs2y ago

One possible reason is trade secrets in their source. There's generally more to source code than just what actually ends up in binary releases and that might contain such secrets.

fooker2y ago

I bet they have this for internal users.

xeromal2y ago

Having seen some very secret and proprietary Microsoft code, you don't want to use it anyways. lol

JB_Dev2y ago

Aware

jacquesm2y ago

A very relevant and recent posting:

GitHub Copilot and open source laundering

https://drewdevault.com/2022/06/23/Copilot-GPL-washing.html

Previously on HN, in case you missed it:

https://news.ycombinator.com/item?id=31848433

IshKebab2y ago

This misunderstanding of copyright is extremely common among programmers. He probably should have read this classic before writing so much:

https://ansuz.sooke.bc.ca/entry/23

denysvitali2y ago

Thanks for the link, it was a very interesting read!

CameronNemo2y ago

Meanwhile they strike deals with news agencies to use their content to train on... This is of going to be a hard fight, but I really hope this ends up costing MS.

1 more reply

sublinear2y ago

Yeah is it becoming clear enough to some people yet that you can't replace software engineers, let alone really help them, with AI? This is only going to get worse, not better.

Copilot is such a flawed product from the start. It's not even a matter of its ability to write "good" code. The concept is just dumb.

There is no AI threat other than the one posed by grifters trying to convince you that there is.

dbmikus2y ago

I use Copilot and it helps me out enough that I keep paying for it.

ChatGPT is also often faster than Google or Stackoverflow for when I'm working with unfamiliar APIs.

sublinear2y ago

It may get you to the first working iteration faster, but it doesn't help ship code faster.

pseudosavant2y ago

I use GPT4 on the CLI via ShellGPT. Piping in `tail /var/log/nginx/error.log` and asking "What is going wrong here?" is amazing. I'll never use `man` to figure out how to use a CLI tool again either.

dbmikus2y ago

If my iteration times are faster, then I ship code faster.

stale20022y ago

I think that you are underestimating how much software engineering work is easy CRUD web development.

For stuff like that, a lot of code can be automated. Sure it may not work right out of the box. But doing a prompt for generally what you want can speed up the process significantly.

Even beyond just generating code, there are a lot of general things that AI helps with.

sublinear2y ago

Uhh yeah so anyway... in the real world, the frontend is the most volatile part. You're not automating that away either so long as there exist requirements from non-coders.

As for the API side of things, CRUD only looks easy when lots of hard work has been put into it. I guess you're advocating for monolithic data, but that's not really CRUD. That's just lazy and bad.

stale20022y ago

> in the real world

I've worked at FAANG companies before making the standard X00,000$ total comp on projects with millions of users. I know how development at top companies "in the real world" works.

> the frontend is the most volatile part

Ok, whatever. Fortunately there are more things out there in software dev than just the one specific usecase you brought up. And its useful for that.

> I guess you're advocating

No, I am saying that as of right now, AI is a tool that speeds up development process significantly. And I am not talking about just generating a lot of stuff at once.

There are hybrid approaches that a human can use, to use AI, as well as code themselves that are useful.

And one specific example, would be that you can instantly look up an error and take suggestions for fixing it to get ideas.

hulitu2y ago

> Microsoft will assume liability for legal copyright risks of Copilot

Extinguish.

naikrovek2y ago

The logical leaps here are insane.

You're saying that if Copilot replicates GPL-licensed software, that it will kill the GPL? after all the time and money MS have spent to do this in the past, only to fail?

wtf

RIMR2y ago

It makes sense to me. Microsoft has long fought against open source licensing, even going as far as to call it a "cancer".

They may have, over the past decade, embraced a lot of open source software out of necessity, but their stance on licensing hasn't changed.

jacquesm2y ago

This may be relevant as background for that terse comment:

https://news.ycombinator.com/item?id=37423899

shortrounddev22y ago

Why do you say that

nico2y ago

It’s a reference to MS strategy:

https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_extingu...

shortrounddev22y ago

I know the reference, why being it up here

hulitu2y ago

Because now every copyright claim for GPL SW will hit the wall of Microsoft's lawyers.

boringg2y ago

That was my first thought as well. Pay a phalanx of lawyers to shut down any case against them - only the most well funded effort would get through. The risk to attack is too high.

recursive2y ago

According to the alleged playbook, the thing being extinguished is the same as the thing being extended.

But Microsoft has had a wall of lawyers for a long time. Microsoft's potential first-party GPL violations would have been defended by their lawyers for decades now.

This take seems to be stretching for a Microsoft bad interpretation.

1 more reply

stale20022y ago

People being able to get around copyright violations sounds like something that promotes the sharing of code not extinguishing it.

3 more replies

naikrovek2y ago

How dare they? amirite?

bcrosby952y ago

fooker2y ago

While that's true, voted comments are a decent indicator of general opinions and trends.

There is a reason voting works (in this context, and otherwise), you can't always give up after declaring that people have differing opinions.

rurp2y ago

1 more reply

crazygringo2y ago

Nevertheless, there are standard opinions that get upvoted and get downvoted.

There is definitely a prevailing ethos here and it's valid to point out potential inconsistencies.

naikrovek2y ago

"people on this site" includes the people I'm talking about, as well.

are you saying that I should name them specifically? or is "people" too general?

skywhopper2y ago

But for folks that are negative on both accounts, maybe they've just learned their lesson from decades of watching Microsoft take the low road over and over again.

j / k navigate · click thread line to collapse