undefined | Better HN

0 pointsWorld1773y ago0 comments

In my memory, when GitHub released it, they were explicit that using data like this “is common practice in machine learning.” Though, I tried to find the quote and couldn’t, so maybe my memory is wrong and I am remembering a blog post from another organization.

edit: The exact quote was “Training machine learning models on publicly available data is considered fair use across the machine learning community” if you want to search for it.

edit 2: https://web.archive.org/web/20210629142841/http://copilot.gi...

> Frequently Asked Questions -> Training Set -> Why was GitHub Copilot trained on data from publicly available sources?

> Training machine learning models on publicly available data is now common practice across the machine learning community. The models gain insight and accuracy from the public collective intelligence. But this is a new space, and we are keen to engage in a discussion with developers on these topics and lead the industry in setting appropriate standards for training AI models.

0 comments

1 comments · 1 top-level

junon3y ago

Yes, I could claim that pirating music is the "standard" but it doesn't negate the fact that there are copyright laws that could land me with a fine or in jail. GitHub can claim whatever they want, but their policies and the laws surrounding them still stand.

j / k navigate · click thread line to collapse