Understanding Emergent Abilities of Language Models from the Loss Perspective (opens in new tab)

(arxiv.org)

6 pointsmaccaw2y ago1 comments

1 comments

1 comments · 1 top-level

Does this mean that "overtraining" a midsize LLM for many more epochs on a small, representative subset of the dataset used by a larger, more performant LLM might be sufficient for matching the performance of the larger model?

j / k navigate · click thread line to collapse