So if I understand this correctly, the Pile is for code from 2020 backwards? If I wanted anything released in the past 3 years, say something in the SOTA AI space (where a month is a lifetime), I would need the scraper again?
I don't follow how this can compare to direct, live, unrestricted access. I suppose this is just my own hatred of Microsoft shining through. Of course we should accept the status quo, because how dare we suggest Microsoft could operate in a manner that is anti-competitive.
For anyone else trying to catch up, just rent a datacenter, write a crawler, deal with all the intricacies of keeping it in sync in real-time. This sounds trivial, simple even.
I wonder why nobody is doing it? Perhaps everyone doesn't have access to petabytes of storage space, unlimited bandwidth, unlimited proxy-jumps etc.
So the alternative is to buy github?
There are multiple private companies and public institutions that are currently training LLMs.
The work that it required to train an LLM is actually in support of fair use, just as it was with regards to Google scanning books.