Skip to content
Better HN
Top
Best
Ask
Show
New
Jobs
Search
⌘K
0 points
riku_iki
2y ago
0 comments
Save
Share
Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..
0 comments
3 comments · 1 top-level
top
newest
oldest
GaggiX
2y ago
· 2 in thread
It also performs better on any other size.
riku_iki
OP
2y ago
They have baseline transformer of max size 6B in tables. Other models are trained on very different data and probably differently.
GaggiX
2y ago
All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.
1 more reply
j
/
k
navigate · click thread line to collapse