Amazon has a way to scrape GitHub and feed its AI model (opens in new tab)

(dataconomy.com)

65 pointsdoubtfuluser2y ago59 comments

59 comments

41 comments · 17 top-level

koolba2y ago· 6 in thread

Is the cover image itself generated via some ML model? The old guy in the middle is missing substantial parts of his arm. The box right by him also has some artifacting in the corner.

amelius2y ago

No, this depicts exactly the nightmarish nature of a job at an Amazon warehouse.

KineticLensman2y ago

Yes, the Amazon brand arrow at centre top is also broken. In fact all of the people look wrong in some way

batch122y ago

Yeah. It is likely an edited AI image. You can confirm by looking at the text on the box at the top left. "BMGOMa"

willwade2y ago

and the guy on the right.. umm.. what's with his face? Or is he an Alien maybe? Image credit goes to https://linkmedya.com - it doesn't say it is AI-generating content but yep, it certainly looks like it

stamourd2y ago

"Featured image credit: Eray Eliaçık/Bing"

firtoz2y ago

And the "Bing" just links to Bing's Dalle3 functionality

glimshe2y ago· 6 in thread

I couldn't care less about these huge tech companies stealing from one another. Let them sue themselves to extinction.

graemep2y ago

Hos is it even stealing? its taking copies of public information, mostly under open source licences.

Amazon is causing a bit of extra server load for MS to handle.

noprocrasted2y ago

> its taking copies of public information

Yet the same companies will be first to tell you that scraping their public information is against ToS or even illegal.

See the whole drama about LinkedIn scraping, etc.

1 more reply

73737373732y ago

> a bit of extra server load

One of my sites has been spammed by scrapers (Bytedance's Bytespider, Googlebot, Bingbot) several thousand times within just an hour, to the point of making it break. They do this without notification or asking for consent of the users creating the content they ingest and possibly use to train AI models with, and also without credit or compensation. I think the world needs strict regulation against this kind of parasitic, likely illegal behavior.

3 more replies

lobsterthief2y ago

You do know Microsoft has also used all private repositories to train its models, right? Especially for Copilot

1 more reply

beardyw2y ago

What is being scraped is not GitHub s data. It's other people's.

threecheese2y ago

Exactly. I permissively license my code, but not because I want to improve mega-corp’s bottom line. I’m annoyed.

I felt exactly like this when I learned that some of my Goodwill donations - the good stuff - is marked up and sold online, instead of going to low income folks at low-income prices. It might be even worse, given the capability they are building intends to compete with me directly as a developer. It’s like if Goodwill started funding domestic terrorists or the local burglars union.

xmodem2y ago· 3 in thread

Ethically Microsoft has about as much claim to be able to use the data for co-pilot as anyone else.

On the other hand, maybe a MSFT v Amazon lawsuit over this could be the wake up call the world needs that maybe we should stop centralising critical infrastructure in the hands of a single company. Which is why I think they wouldn't do it - at most I could see Microsoft tightening request limits on accounts associated with Amazon.

drewcoo2y ago

> maybe we should stop centralising critical infrastructure in the hands of a single company

Managing your own on-prem or in-colo infrastructure sucks: it's expensive and a source of risk, which is why we moved things like source servers to a centralized model.

cyanydeez2y ago

Well do.that right when distributed computing finds a workable modrl

xmodem2y ago

Yeah, I guess building a distributed version control system is basically an intractable problem.

1 more reply

jsnell2y ago· 2 in thread

I'm surprised Amazon's legal team signed off on this. It's clearly against the GitHub terms of service[0], and Amazon employees acting on the instructions from Amazon had to approve those terms. It seems pretty much identical to the LinkedIn vs. hiQ scraping case, where as I understand the fake account creation was the key point.

[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.

londons_explore2y ago

When you pay your legal teams as much as Amazons, they probably tell you "Yeah, you'd probably lose any case, but the fine will be a couple of million dollars and you won't have to pay it for a decade, and by then you'd have cemented your market leadership".

that_guy_iain2y ago

What if they‘re not free accounts?

hi-v-rocknroll2y ago· 2 in thread

MSFT's LinkedIn scraping was also a thing about 10 years ago until the magic method was taken away. :'(

altdataseller2y ago

You can still scrape Linkedin today, can you not?

hi-v-rocknroll2y ago

No, I don't think so. Not without an account and not completely as was possible in the past.

htrp2y ago· 2 in thread

disappointing that large mega Corp does the exact same thing broke developers do to get around rate limits

shermantanktop2y ago

This is a large mega corp that prides itself on acting like it is broke.

amelius2y ago

Except they have access to one of the largest "botnets" on the planet.

neilv2y ago· 1 in thread

Separate from the courts, Microsoft could send a message to the AI gold rush field, about "abuse of Microsoft's resources", via ToS:

* All Amazon domain names could be banned from accounts on GitHub, or face annoying restrictions, implemented with trivial technical changes. And lawyers could send a letter to Amazon legal, about how Amazon may and may not use GitHub, including Amazon personnel having to disclose their affiliation (not hide it with GMail), and craft some language about how those employee accounts may and may not be used.

* More harshly, but fear-instilling to individuals throughout industry, the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation. Not only those particular accounts, but any accounts the individuals might use. (This would hurt, not only for genuine open source participation, but also given how open source is sometimes used for job-hunting appearances, and all the current employers that ask for candidate's "GitHub" specifically rather than open source in general.) If banning would have undesired effects of projects GitHub wants to host being pulled, or public reaction as too harsh and questioning why GitHub has so much power, there could instead be annoying restrictions.

rdtsc2y ago

> the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation.

That would work, assuming GH doesn’t make mistakes and ban someone else with the same name m. That would then be embarrassing for GH. I can already see news headline “Github banned my account because my name matches that of a web scraping account from Amazon”

foreigner2y ago· 1 in thread

Microsoft could sabotage Amazon's AI model by returning poisoned code to accounts registered with @amazon.com email addresses.

xmodem2y ago

The way git works means that you can check that you have an un-doctored clone of a repo just by checking that the commit hash matches. Which in this instance is quite unfortunate, because it would be very funny.

(barring a SHA-1 collision, of course)

EDIT: i suppose another approach could be to invent poisoned repos out of whole cloth and only show them to Amazon, but I susepct that'd be even easier to detect.

raarts2y ago· 1 in thread

Language in this article smells like it's written or rewritten by AI.

belter2y ago

Agree. Looks like we have a good hear: https://youtu.be/zbo6SdyWGns?t=78

Kye2y ago

Is it git pull?

>> "In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials."

Ah, no, it's git pool.

lokimedes2y ago

This just rekindled my desire to self-host my git repos. The whole idea that a platform provider can use the IP I host there is obscene. That thieves steal by bounty from each other is not the story.

paradite2y ago

Microsoft is probably one of the few companies that can sue Amazon without worrying about retaliation from Amazon.

For example, GitLab would need to think twice before suing because they offer deployment on AWS.

threecheese2y ago

Can anyone share a Fermi estimation of the size of poison-pill training data required to impact code interpreter models? (of the size that AMZN might be building with this data)

I expect it would vary by language/platform popularity (size of available training code). Is it infeasible to create or generate enough code, pushed to enough repositories, to impact the correctness of a model that includes the code in its training data set?

lofaszvanitt2y ago

MS only provides the infra, everything else is other's hard work under the trojan horse open source whatever. If they introduce limits, time to leave github. This will evolve into an elsevier vs researchers kinda situation.

chumanak2y ago

This article doesn’t make any sense. Why would Amazon make their employees do all this when they can easily pay for a service like crawlbase or similar and easily scrape github without having to create employee accounts?

rty322y ago

If github cares enough about this, they would have already sued Amazon. I don't think the author needs to worry about any of this

amadeuspagel2y ago

They should send make this data available for everyone on AWS.

j / k navigate · click thread line to collapse

59 comments

41 comments · 17 top-level

koolba2y ago· 6 in thread

Is the cover image itself generated via some ML model? The old guy in the middle is missing substantial parts of his arm. The box right by him also has some artifacting in the corner.

amelius2y ago

No, this depicts exactly the nightmarish nature of a job at an Amazon warehouse.

KineticLensman2y ago

Yes, the Amazon brand arrow at centre top is also broken. In fact all of the people look wrong in some way

batch122y ago

Yeah. It is likely an edited AI image. You can confirm by looking at the text on the box at the top left. "BMGOMa"

willwade2y ago

stamourd2y ago

"Featured image credit: Eray Eliaçık/Bing"

firtoz2y ago

And the "Bing" just links to Bing's Dalle3 functionality

glimshe2y ago· 6 in thread

I couldn't care less about these huge tech companies stealing from one another. Let them sue themselves to extinction.

graemep2y ago

Hos is it even stealing? its taking copies of public information, mostly under open source licences.

Amazon is causing a bit of extra server load for MS to handle.

noprocrasted2y ago

> its taking copies of public information

Yet the same companies will be first to tell you that scraping their public information is against ToS or even illegal.

See the whole drama about LinkedIn scraping, etc.

1 more reply

73737373732y ago

> a bit of extra server load

3 more replies

lobsterthief2y ago

You do know Microsoft has also used all private repositories to train its models, right? Especially for Copilot

1 more reply

beardyw2y ago

What is being scraped is not GitHub s data. It's other people's.

threecheese2y ago

Exactly. I permissively license my code, but not because I want to improve mega-corp’s bottom line. I’m annoyed.

xmodem2y ago· 3 in thread

Ethically Microsoft has about as much claim to be able to use the data for co-pilot as anyone else.

drewcoo2y ago

> maybe we should stop centralising critical infrastructure in the hands of a single company

Managing your own on-prem or in-colo infrastructure sucks: it's expensive and a source of risk, which is why we moved things like source servers to a centralized model.

cyanydeez2y ago

Well do.that right when distributed computing finds a workable modrl

xmodem2y ago

Yeah, I guess building a distributed version control system is basically an intractable problem.

1 more reply

jsnell2y ago· 2 in thread

[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.

londons_explore2y ago

that_guy_iain2y ago

What if they‘re not free accounts?

hi-v-rocknroll2y ago· 2 in thread

MSFT's LinkedIn scraping was also a thing about 10 years ago until the magic method was taken away. :'(

altdataseller2y ago

You can still scrape Linkedin today, can you not?

hi-v-rocknroll2y ago

No, I don't think so. Not without an account and not completely as was possible in the past.

htrp2y ago· 2 in thread

disappointing that large mega Corp does the exact same thing broke developers do to get around rate limits

shermantanktop2y ago

This is a large mega corp that prides itself on acting like it is broke.

amelius2y ago

Except they have access to one of the largest "botnets" on the planet.

neilv2y ago· 1 in thread

Separate from the courts, Microsoft could send a message to the AI gold rush field, about "abuse of Microsoft's resources", via ToS:

rdtsc2y ago

> the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation.

foreigner2y ago· 1 in thread

Microsoft could sabotage Amazon's AI model by returning poisoned code to accounts registered with @amazon.com email addresses.

xmodem2y ago

(barring a SHA-1 collision, of course)

EDIT: i suppose another approach could be to invent poisoned repos out of whole cloth and only show them to Amazon, but I susepct that'd be even easier to detect.

raarts2y ago· 1 in thread

Language in this article smells like it's written or rewritten by AI.

belter2y ago

Agree. Looks like we have a good hear: https://youtu.be/zbo6SdyWGns?t=78

Kye2y ago

Is it git pull?

>> "In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials."

Ah, no, it's git pool.

lokimedes2y ago

This just rekindled my desire to self-host my git repos. The whole idea that a platform provider can use the IP I host there is obscene. That thieves steal by bounty from each other is not the story.

paradite2y ago

Microsoft is probably one of the few companies that can sue Amazon without worrying about retaliation from Amazon.

For example, GitLab would need to think twice before suing because they offer deployment on AWS.

threecheese2y ago

Can anyone share a Fermi estimation of the size of poison-pill training data required to impact code interpreter models? (of the size that AMZN might be building with this data)

lofaszvanitt2y ago

chumanak2y ago

rty322y ago

If github cares enough about this, they would have already sued Amazon. I don't think the author needs to worry about any of this

amadeuspagel2y ago

They should send make this data available for everyone on AWS.

j / k navigate · click thread line to collapse