TinyLlama project aims to pretrain a 1.1B Llama model on 3T tokens (opens in new tab)

(github.com)

201 pointscmitsakis2y ago60 comments

60 comments

42 comments · 12 top-level

GaggiX2y ago· 8 in thread

>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100.

They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.

npsomaratna2y ago

Possibly a lot. See: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

isoprophlex2y ago

Very interesting, thanks for sharing!

pluijzer2y ago

I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.

elpocko2y ago

Lora and Alpaca at Tanagra.

1 more reply

koprulusector2y ago

There’s should also be a tribble in there, somewhere.

sp3322y ago

Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.

GaggiX2y ago

Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.

cypress662y ago

It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.

imjonse2y ago· 5 in thread

From the FAQ:

' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?

Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'

It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

jofi12y ago

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.

https://twitter.com/sherjilozair/status/1687837844729966592

jph002y ago

AFAIK re-warming it up and then gradually decreasing it again ought to work fine. Have you seen any research showing that it doesn't?

1 more reply

charcircuit2y ago

You could manually increase the learning rate or change the decay at any time.

naillo2y ago

The most plausible explanation I've seen (other than the carmack 'sudden grokking' beyond the cutoff idea) is that they're planning to release llama3 sooner than later with some arcitecture changes for even better performance, so it makes sense to dedicate resources there instead.

ftxbro2y ago

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down?

If I remember correctly, it's because the main reason they trained multiple models was to show a scaling trend. Each model was trained using a chinchilla-optimal mix of model size, cpu amount, and parameter size. The point was to provide an empirical scaling law that could possibly be extrapolated to estimate the performance of more expensive models, like imagine a billion dollar model for which the model size, data size, and cpu amount is picked in the chinchilla optimal ratios.

On small models the chinchilla optimal scaling stops training the model even when the model is still improving.

The problem comes when people are actually using these small llama models rather than treating them as just data points. If you are actually using these models, what you want is one that is trained forever on as many tokens and training time as possible.

minimaxir2y ago· 4 in thread

A robust 1.1B model compared to a 7B model would be strongly appreciated. The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100; dropping it by an order of magnitude and letting it run on other cloud GPUs will open new opportunities.

brucethemoose22y ago

> The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100

?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.

I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.

I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.

littlestymaar2y ago

> ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read,

That's the thing: you need a whole GPU per concurrent user, this is insanely expensive if you want to run it as part of a SaaS (which is what most for-profit want to do). Of course running models locally is much better in almost every regard, but nobody is gonna be a billionaire with that…

2 more replies

snovv_crash2y ago

That's still wildly too expensive if you want to make a profitable service that is scalable beyond VC capital injections.

cypress662y ago

1.1B with 3T tokens will never be comparable to 7B with 2T tokens.

And I'm not sure what you mean by inference latency being infeasible. Most people using thsss models at home don't even bother with the 7B and go straight to 13B because it's easy to run too and much smarter. And any cloud gpu can run 13B.

Havoc2y ago· 3 in thread

What does “pretrain” mean in this context? It sounds like normal training

rodonn2y ago

GPT stands for Generative Pre-trained Transformer.

The "main" training step using huge amounts of inputs is called pre-training. The idea is that after that pre-training, you might fine tune the model for your specific use case.

Havoc2y ago

I see...that makes sense. Thanks for explaining

Filligree2y ago

As opposed to fine-tuning or in-context learning. It really is normal training.

RC_ITR2y ago· 3 in thread

Not to be a downer, but wasn’t one of OpenAI’s earliest discoveries that training small models on huge datasets leads to over-fitting?

It’s my understanding that the entire race to ever-more parameters was driven by that.

minimaxir2y ago

A workaround to overfitting is to train on so much distinct data that the model can't overfit.

Newer large datasets like the ones used here optimize for diversity. (e.g. SlimPajama is a heavily-deduped dataset)

ljlolel2y ago

Learn about the magic of double descent

RC_ITR2y ago

https://openai.com/research/deep-double-descent

Yeah, the line keeps going down as the model gets bigger. What's your point? That there's a hump in the middle?

syntaxing2y ago· 2 in thread

This sounds like a really fun project, running small models would change a lot of industries like games in their example. But how do people afford these projects?! If I am doing my numbers right, it'll cost them 50K to train this model for 3T tokens.

jlokier2y ago

That's less than a month's income for a few people on here. I recall a comment from an engineer at Nvidia a year or two ago saying $700k/year was about much they were paid, in response to someone else not believing those levels.

Get together 5 people in that position and it's less than a week's income for the group. That sounds doable as a hobby for those lucky people.

More realistically, it's within range for a grant, or use of someone else's hardware if they aren't using it, as the sibling comment from wongarsu said.

Also cloud vendors sometimes give out large batches of credits to startups and such as marketing incentive to get future customers.

wongarsu2y ago

$38k, based on the "90 days using 16 A100-40G" and lambdalabs prices.

That's a lot for a hobby, but small enough that it might be running on a university machine (the TinyLlama devs provide a way to cite them and all seem to work or study at Singapore University of Technology) or could be sponsored (no indication of that now, but "people made an awesome model in our cloud" is good advertisement). Government grants or grants in general also aren't out of the question, especially for a topic with this much hype.

kristianp2y ago· 2 in thread

Could this be used as a source of speculative tokens for larger llama models?, as per https://github.com/ggerganov/llama.cpp/pull/2926

Also, when are we going to start seeing open weights MOE models being released?

thawab2y ago

1- yes, Gorgie twetted he is looking into it[0].

2- The only 2 i know of are airoboros[1] and Hydra which is still in progress.

[0] https://x.com/ggerganov/status/1698667093711880687?s=46&t=Jp...

[1] https://github.com/jondurbin/airoboros#lmoe

kristianp2y ago

Thanks. Yes, I've seen airoboros, it aims to use a mixture of fine-tunes of the base model if I recall correctly. Not a truly pre-trained MOE, but could be useful.

Hydra, is this it? https://github.com/SkunkworksAI/hydra-moe

1 more reply

sp3322y ago· 1 in thread

The link that says you can watch cross-entropy loss live is locked or broken.

fragebogen2y ago

Works now for me https://wandb.ai/lance777/lightning_logs/reports/metric-trai...

alexedw2y ago· 1 in thread

This is silly. Look at the loss and benchmark curves for the Pythia suite of models - the smaller models certainly did saturate and in fact began worsening.

2T not saturating on a 7B is very different from 3T on a 1B.

littlestymaar2y ago

That's the point of the experiment actually…

e12e2y ago· 1 in thread

Are they upsampling - whatever that means in the context of datasets?

AFAIU slim pajama is about 627B tokens, and Starcoder:

> approximately 250 Billion tokens.

Ed: I see TFA says:

> Combined Dataset Size - Around 950B tokens

> Total Tokens During Training - 3 trillion (slightly more than 3 epochs/1430k steps)

... but I'm not seeing how one becomes three? That's more like 1 trillion than 3 trillion tokens?

emikulic2y ago

Three epochs means it sees each token three times. The dataset is ~1T like you said.

Mxbonn2y ago

Couldn't immediately find it but who sponsors/pays for the compute?

29athrowaway2y ago

A tiny llama would be hard to distinguish from an alpaca.

j / k navigate · click thread line to collapse

60 comments

42 comments · 12 top-level

GaggiX2y ago· 8 in thread

>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100.

They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.

npsomaratna2y ago

Possibly a lot. See: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

isoprophlex2y ago

Very interesting, thanks for sharing!

pluijzer2y ago

elpocko2y ago

Lora and Alpaca at Tanagra.

1 more reply

koprulusector2y ago

There’s should also be a tribble in there, somewhere.

sp3322y ago

GaggiX2y ago

Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.

cypress662y ago

It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.

imjonse2y ago· 5 in thread

From the FAQ:

' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?

jofi12y ago

Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.

https://twitter.com/sherjilozair/status/1687837844729966592

jph002y ago

AFAIK re-warming it up and then gradually decreasing it again ought to work fine. Have you seen any research showing that it doesn't?

1 more reply

charcircuit2y ago

You could manually increase the learning rate or change the decay at any time.

naillo2y ago

ftxbro2y ago

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down?

On small models the chinchilla optimal scaling stops training the model even when the model is still improving.

minimaxir2y ago· 4 in thread

brucethemoose22y ago

> The bottleneck of Llama 2 7B is that inference latency is still infeasible for Production use cases unless you have a good supply of expensive A100

?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read, if not faster. A somewhat bigger consumer GPU can batch it and serve dozens of users.

I use 13B finetunes on my 2020 14" laptop all the time, with 6GB of VRAM and 16GB of CPU RAM.

I have seen many people on HN say this, and I can't help but wonder why the optimized, quantized llama implementations are flying under the radar.

littlestymaar2y ago

> ?? A 3060 or a slightly bigger AMD/Intel GPU can stream llama 7B about as fast as someone can read,

2 more replies

snovv_crash2y ago

That's still wildly too expensive if you want to make a profitable service that is scalable beyond VC capital injections.

cypress662y ago

1.1B with 3T tokens will never be comparable to 7B with 2T tokens.

Havoc2y ago· 3 in thread

What does “pretrain” mean in this context? It sounds like normal training

rodonn2y ago

GPT stands for Generative Pre-trained Transformer.

The "main" training step using huge amounts of inputs is called pre-training. The idea is that after that pre-training, you might fine tune the model for your specific use case.

Havoc2y ago

I see...that makes sense. Thanks for explaining

Filligree2y ago

As opposed to fine-tuning or in-context learning. It really is normal training.

RC_ITR2y ago· 3 in thread

Not to be a downer, but wasn’t one of OpenAI’s earliest discoveries that training small models on huge datasets leads to over-fitting?

It’s my understanding that the entire race to ever-more parameters was driven by that.

minimaxir2y ago

A workaround to overfitting is to train on so much distinct data that the model can't overfit.

Newer large datasets like the ones used here optimize for diversity. (e.g. SlimPajama is a heavily-deduped dataset)

ljlolel2y ago

Learn about the magic of double descent

RC_ITR2y ago

https://openai.com/research/deep-double-descent

Yeah, the line keeps going down as the model gets bigger. What's your point? That there's a hump in the middle?

syntaxing2y ago· 2 in thread

jlokier2y ago

Get together 5 people in that position and it's less than a week's income for the group. That sounds doable as a hobby for those lucky people.

More realistically, it's within range for a grant, or use of someone else's hardware if they aren't using it, as the sibling comment from wongarsu said.

Also cloud vendors sometimes give out large batches of credits to startups and such as marketing incentive to get future customers.

wongarsu2y ago

$38k, based on the "90 days using 16 A100-40G" and lambdalabs prices.

kristianp2y ago· 2 in thread

Could this be used as a source of speculative tokens for larger llama models?, as per https://github.com/ggerganov/llama.cpp/pull/2926

Also, when are we going to start seeing open weights MOE models being released?

thawab2y ago

1- yes, Gorgie twetted he is looking into it[0].

2- The only 2 i know of are airoboros[1] and Hydra which is still in progress.

[0] https://x.com/ggerganov/status/1698667093711880687?s=46&t=Jp...

[1] https://github.com/jondurbin/airoboros#lmoe

kristianp2y ago

Thanks. Yes, I've seen airoboros, it aims to use a mixture of fine-tunes of the base model if I recall correctly. Not a truly pre-trained MOE, but could be useful.

Hydra, is this it? https://github.com/SkunkworksAI/hydra-moe

1 more reply

sp3322y ago· 1 in thread

The link that says you can watch cross-entropy loss live is locked or broken.

fragebogen2y ago

Works now for me https://wandb.ai/lance777/lightning_logs/reports/metric-trai...

alexedw2y ago· 1 in thread

This is silly. Look at the loss and benchmark curves for the Pythia suite of models - the smaller models certainly did saturate and in fact began worsening.

2T not saturating on a 7B is very different from 3T on a 1B.

littlestymaar2y ago

That's the point of the experiment actually…

e12e2y ago· 1 in thread

Are they upsampling - whatever that means in the context of datasets?

AFAIU slim pajama is about 627B tokens, and Starcoder:

> approximately 250 Billion tokens.

Ed: I see TFA says:

> Combined Dataset Size - Around 950B tokens

> Total Tokens During Training - 3 trillion (slightly more than 3 epochs/1430k steps)

... but I'm not seeing how one becomes three? That's more like 1 trillion than 3 trillion tokens?

emikulic2y ago

Three epochs means it sees each token three times. The dataset is ~1T like you said.

Mxbonn2y ago

Couldn't immediately find it but who sponsors/pays for the compute?

29athrowaway2y ago

A tiny llama would be hard to distinguish from an alpaca.

j / k navigate · click thread line to collapse