undefined | Better HN

0 pointscs7022y ago0 comments

Thank you. Your key point -- that so far all models with the proposed methods may have been only "grossly trained" -- is compelling. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That seems sensible to me, and makes replication easier, but I agree we need more to see extensive testing, after more extensive pretraining, on models of larger sizes.

0 comments

gliptic2y ago

They also trained 3B with 2 trillion tokens.

> The number of training tokens is a crucial factor for LLMs. To test the scalability of BitNet b1.58 in terms of tokens, we trained a BitNet b1.58 model with 2T tokens following the data recipe of StableLM-3B [ TBMR], which is the state-of-the-art open-source 3B model.

> [..]

> Our findings shows that BitNet b1.58 achieves a superior performance on all end tasks, indicating that 1.58-bit LLMs also have strong generalization capabilities.

craq2y ago

And I was hoping to agree on this, but there is no 'SOTA StableLM-3b' with 2T tokens. Which is a big gap in the paper, because StableLM 3B is trained on 1T tokens for 4 epochs. And the benchmarks they report far exceed the benchmarks shown in the paper. You can find them in the official StableLM git and compare to the results in the paper https://github.com/Stability-AI/StableLM?tab=readme-ov-file#...

cs702OP2y ago

You're right. Thank you for pointing that out!

j / k navigate · click thread line to collapse