Crazy how we went from first computers spanning across entire floors to fitting all of human thoughts in such tight space.
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Si...
As part of their work on democratizing AI, they're now hoping to replicate GPT-3 and release it for free (unlike OpenAI's API).
I would encourage everyone interested to join their discord server (https://discord.gg/BK2v3EJ) -- they're extremely friendly and I think it's a project worth contributing to.
One small comment: It would be great for this (and other) datasets to give a quick "sample data" file - preferably one that doesn't need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn't practical.
I'll reproduce a line here:
{"text": "Roman Catholic Diocese of Tambacounda\n\nThe Roman Catholic Diocese of Tambacounda () is a diocese located in the city of Tambacounda in the Ecclesiastical province of Dakar in Senegal.\n\nHistory\n August 13, 1970: Established as Apostolic Prefecture of Tambacounda from the Diocese of Kaolack and Diocese of Saint-Louis du S\u00e9n\u00e9gal\n April 17, 1989: Promoted as Diocese of Tambacounda\n\nSpecial churches\n The cathedral is Cath\u00e9drale Marie Reine de l\u2019Univers in Tambacounda, which is located in the Medina Coura neighborhood of the town.\n\nLeadership\n Bishops of Tambacounda (Roman rite)\n Bishop Jean-No\u00ebl Diouf (since 1989.04.17)\n Prefects Apostolic of Tambacounda (Roman rite) \n Fr. Cl\u00e9ment Cailleau, C.S.Sp. (1970.08.13 \u2013 1986.04.24)\n\nSee also\nRoman Catholicism in Senegal\n\nReferences\n\nExternal links\n GCatholic.org\n Catholic Hierarchy \n\nCategory:Roman Catholic dioceses in Senegal\nCategory:Tambacounda\nCategory:Christian organizations established in 1970\nCategory:Roman Catholic dioceses and prelatures established in the 20th century", "meta": {"pile_set_name": "Wikipedia (en)"}}
[0]: https://json-schema.org/https://the-eye.eu/public/AI/pile_preliminary_components/
I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)
By the way, consider contributing to The Eye: https://the-eye.eu/
Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.
The difference with The Eye is that they thoroughly vet the legitimacy of each DMCA claim. The claimant is required to show proof that they are the legal copyright holder. And The Eye seems willing to call bluffs: on more than one occasion they have dealt with bogus DMCAs in ways where normal companies would simply give up.
Ultimately, their actions are legal. And for us in the AI community, it was something like the hand of God reaching down to bless us with a guardian angel. The reproducibility situation is getting worse each week, and much of that is due to the fact that realistic datasets can’t be distributed without fear of reprisals.
That said, I respect your feelings on the matter too. I think it’s equally valid to feel uncomfortable. I just take solace in the fact that it’s legal.
Does anyone know if OpenAI has retrained/updated gpt-3 yet?