I have a feeling they had access to a lot of code on GH, who knows how much code they actually accessed. Copilot for a long time said it would use your code as training data, including context, if you didn’t opt out explicitly, so that’s already millions maybe hundreds of millions of lines of code scraped.
The conspiracy theorist in me wonders if MS just didn’t provide access to public and private code to train on, they wouldn’t have even told Open AI, just said, “here’s some nice data”, it’s all secret and we can’t see the models inputs so I’ll leave it at that. I mean they’ve obviously prepared the data for copilot, so it was there waiting to be trained on.
So yeah I feel your enthusiasm but if you think about it a little more, or maybe not so hard to imagine what you saw being actually rather simple ? Every time I write code I feel kind of depressed because I know almost certainly someone has already written the same thing and that it’s sitting in GitHub or somewhere else and I’m wasting my time.
ChatGPT just takes away the knowing where to find something (it’s already seen almost everything the average person can think of) you want and gives it to you directly. Have you never thought of this already ? Like you knew all the code you wanted already was there somewhere, but you just didn’t have an interface to get to it? I’ve thought about this for quite a while and I knew there would big data people doing experiments who could see that probably 80-90% of code on GitHub is pretty much identical.
Nothing is magic, right ?
No comments yet.