undefined | Better HN

0 pointsriku_iki2y ago0 comments

Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..

0 comments

3 comments · 1 top-level

GaggiX2y ago· 2 in thread

It also performs better on any other size.

They have baseline transformer of max size 6B in tables. Other models are trained on very different data and probably differently.

All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.

j / k navigate · click thread line to collapse