At some point, higher quality tokens will be far more important than more tokens. No telling how much junk is in that 2T.
But I wonder if data augmentations could help? For instance, ask LLaMA 70B to reword everything in a dataset, and you can train over the same data multiple times without repeats.
A great idea. If we are at it, why don't we search all topics and then summarise with a LLM? It would be like an AI made wikipedia 1000x times larger indexing all things, concepts and events, or a super knowledge graph. It would create a lot of training data, and maybe add a bit of introspection to the model - it explicitly knows what it knows. Could help reduce hallucinations, learn attribution, ability to recognise copyrighted content, and fact checking.
I have this pet proposal that LLMs would be pretty nice to help fill out WikiData https://friend.computer/jekyll/update/2023/04/30/wikidata-ll..., as the technique of getting LLMs to write queries, instead of directly giving data, has worked really well so far for me.
You are totally right - both more and better matters. There are many good papers on the importance of data quality, Textbooks Are All You Need is one that comes to mind - https://arxiv.org/abs/2306.11644