The Facebook Llama paper[1] calls out a number of the ones they used.
Briefly, they include:
- CommonCrawl
- Wikipedia
- Github
- StackExchange
- arXiv
and "Books" which appears to be made up from Project Gutenberg books among others.
[1]: https://scontent-iad3-1.xx.fbcdn.net/v/t39.8562-6/333078981_...