1) Require each repository to opt-in to be learned from.
2) Require any source file used for learning to have an SPDX license heading.
3) Have a list of approved permissive licenses to avoid any proprietary or copyleft arguments.
Using SPDX headings as the explicit guide would solve the problem of different code content using a different license within a project. An example being QtWayland: the client pieces are Proprietary/LGPL/GPL, whereas the compositor parts are Proprietary/GPL. That's not something you'd know from the license files at the root of the project (and post-6.3 they use SPDX instead of the prior license template heading).
Granted, this doesn't solve the problem of the chain of trust (is the individual publishing the code truly the copyright owner), but I think it would be a basic start for a program like this. The opt-in nature would make things... difficult, but I think that's a fair trade-off for something like this.