If a permissive, biz-friendly license (Apache 2.0, maybe others) is found in a given Repo, then it can be used in training set
Otherwise, the repo cannot be used in a training set
The LICENSE file would be longer than the rest of the code.
(FWIW, I agree with you theoretically, but practically it's hard to get your head around what the ramifications of that would mean)
I’m really of the opinion that MS needs to document the training set and include a high bar for inclusion of additional repos.