edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.
edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...
> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?
> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.