undefined | Better HN

0 pointsKuriousCat2y ago0 comments

Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?

0 comments

HarHarVeryFunny2y ago

They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.

j / k navigate · click thread line to collapse

0 comments

HarHarVeryFunny2y ago

j / k navigate · click thread line to collapse