IMO, the long tail of non-code-reviewed, written-by-someone-in-their-first-month-of-coding, barely-even-compiles noob code[0] in Github is going to be orders of magnitude larger than the long tail of crap in Microsoft's internal repos.
[0] Hey, everyone has to start somewhere. There's nothing wrong with your first "hello world" program being buggy - that's what being a beginner means. But it's probably not the sort of code you want to train an LLM on.