Amazon is causing a bit of extra server load for MS to handle.
Yet the same companies will be first to tell you that scraping their public information is against ToS or even illegal.
See the whole drama about LinkedIn scraping, etc.
One of my sites has been spammed by scrapers (Bytedance's Bytespider, Googlebot, Bingbot) several thousand times within just an hour, to the point of making it break. They do this without notification or asking for consent of the users creating the content they ingest and possibly use to train AI models with, and also without credit or compensation. I think the world needs strict regulation against this kind of parasitic, likely illegal behavior.
I felt exactly like this when I learned that some of my Goodwill donations - the good stuff - is marked up and sold online, instead of going to low income folks at low-income prices. It might be even worse, given the capability they are building intends to compete with me directly as a developer. It’s like if Goodwill started funding domestic terrorists or the local burglars union.
On the other hand, maybe a MSFT v Amazon lawsuit over this could be the wake up call the world needs that maybe we should stop centralising critical infrastructure in the hands of a single company. Which is why I think they wouldn't do it - at most I could see Microsoft tightening request limits on accounts associated with Amazon.
Managing your own on-prem or in-colo infrastructure sucks: it's expensive and a source of risk, which is why we moved things like source servers to a centralized model.
[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.
* All Amazon domain names could be banned from accounts on GitHub, or face annoying restrictions, implemented with trivial technical changes. And lawyers could send a letter to Amazon legal, about how Amazon may and may not use GitHub, including Amazon personnel having to disclose their affiliation (not hide it with GMail), and craft some language about how those employee accounts may and may not be used.
* More harshly, but fear-instilling to individuals throughout industry, the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation. Not only those particular accounts, but any accounts the individuals might use. (This would hurt, not only for genuine open source participation, but also given how open source is sometimes used for job-hunting appearances, and all the current employers that ask for candidate's "GitHub" specifically rather than open source in general.) If banning would have undesired effects of projects GitHub wants to host being pulled, or public reaction as too harsh and questioning why GitHub has so much power, there could instead be annoying restrictions.
That would work, assuming GH doesn’t make mistakes and ban someone else with the same name m. That would then be embarrassing for GH. I can already see news headline “Github banned my account because my name matches that of a web scraping account from Amazon”
(barring a SHA-1 collision, of course)
EDIT: i suppose another approach could be to invent poisoned repos out of whole cloth and only show them to Amazon, but I susepct that'd be even easier to detect.
>> "In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials."
Ah, no, it's git pool.
For example, GitLab would need to think twice before suing because they offer deployment on AWS.
I expect it would vary by language/platform popularity (size of available training code). Is it infeasible to create or generate enough code, pushed to enough repositories, to impact the correctness of a model that includes the code in its training data set?