undefined | Better HN

0 pointsbrucethemoose22y ago0 comments

At some point, higher quality tokens will be far more important than more tokens. No telling how much junk is in that 2T.

But I wonder if data augmentations could help? For instance, ask LLaMA 70B to reword everything in a dataset, and you can train over the same data multiple times without repeats.

0 comments

3 comments · 2 top-level

visarga2y ago· 1 in thread

A great idea. If we are at it, why don't we search all topics and then summarise with a LLM? It would be like an AI made wikipedia 1000x times larger indexing all things, concepts and events, or a super knowledge graph. It would create a lot of training data, and maybe add a bit of introspection to the model - it explicitly knows what it knows. Could help reduce hallucinations, learn attribution, ability to recognise copyrighted content, and fact checking.

gaogao2y ago

I have this pet proposal that LLMs would be pretty nice to help fill out WikiData https://friend.computer/jekyll/update/2023/04/30/wikidata-ll..., as the technique of getting LLMs to write queries, instead of directly giving data, has worked really well so far for me.

joshhart2y ago

You are totally right - both more and better matters. There are many good papers on the importance of data quality, Textbooks Are All You Need is one that comes to mind - https://arxiv.org/abs/2306.11644

j / k navigate · click thread line to collapse