Zero-3 Offload: Scale DL models to trillion parameters without code changes (opens in new tab)

(deepspeed.ai)

97 pointsghosthamlet5y ago48 comments

48 comments

27 comments · 13 top-level

joshlk5y ago· 4 in thread

GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.

https://github.com/EleutherAI/gpt-neox

ma2rten5y ago

It seems like Zero-3 doesn't work for them:

https://github.com/EleutherAI/gpt-neox/issues/171

stellaathena5y ago

Hi! I’m the one who wrote this code. My ZeRO-3 implementation is currently not working, but I’ve spoken with DeepSpeed devs and they’ve explained to me what I’ve been doing wrong. I haven’t had time to implement the fix but I don’t see any reason to assume it won’t work.

https://github.com/microsoft/DeepSpeed/issues/846

Also, the specific problem described in that Issue was due to a bug I found in DeepSpeed that has since been corrected.

joshlk5y ago

Looks like they got it working recently https://github.com/EleutherAI/gpt-neox/pull/178

dqpb5y ago

Did you even read through the issue? I don't see anything that indicates it won't work.

1 more reply

dataangel5y ago· 3 in thread

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

jonbaer5y ago

I think there is some explanation (on the previous model?) here, https://www.youtube.com/watch?v=tC01FRB0M7w

jiofih5y ago

Third paragraph or so in the overview:

> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency

dataangel5y ago

Yeah that would be the techno-babble. I've been working on a machine learning pipeline for 6 years and I still have no idea what this means.

6 more replies

stephenroller5y ago· 2 in thread

Support for this was also added to [Fairscale](https://fairscale.readthedocs.io/en/latest/) and [Fairseq](https://github.com/pytorch/fairseq) last week. In particular, the Fairscale implementation can be used in any pyotrch project without requiring the use of the Deepspeed trainer.

diptanu5y ago

What are the relevant commits in Fairseq for this? I couldn't figure out the changes by looking at the commits from last week.

stephenroller5y ago

https://github.com/pytorch/fairseq/pull/3331 and https://github.com/pytorch/fairseq/pull/3327

vladf5y ago· 2 in thread

Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.

I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

gwern5y ago

SGD doesn't work on large Transformers, no. You need something like AdamW.

The_rationalist5y ago

Mish is generally superior to RadamW https://lessw.medium.com/meet-mish-new-state-of-the-art-ai-a...

1 more reply

andrewprock5y ago· 2 in thread

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

gwern5y ago

You ideally need ~500GB of text, or so. EleutherAI's The Pile was designed to be just big enough to fit a 1t GPT efficiently, and you can get the various scaling curves out of the OA-related scaling papers. (You want the amount of data that fits into a single epoch, because if you reuse data, you get less bang for the FLOPs buck, and FLOPS constraints are right now much more binding than data or model size.)

andrewprock5y ago

This feels off by a couple of orders of magnitude, unless a significant number of the parameters are not independent.

2 more replies

bevenky5y ago· 1 in thread

This is also being added to pytorch

https://github.com/pytorch/pytorch/pull/46750

minimaxir5y ago

I don't think that's the Stage 3 announced in this blog post, but it's def a framework for it.

FL33TW00D5y ago

Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.

https://huggingface.co/blog/zero-deepspeed-fairscale

ansk5y ago

Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?

alphagrep123455y ago

Simple 10 min overview/tutorial (official) if someone is interested - https://www.youtube.com/watch?v=ovQC7FqXHXk

The_rationalist5y ago

See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy?

singhrac5y ago

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhoward5y ago

please hook this up to Jax!

mchusma5y ago

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!

j / k navigate · click thread line to collapse

48 comments

27 comments · 13 top-level

joshlk5y ago· 4 in thread

GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.

https://github.com/EleutherAI/gpt-neox

ma2rten5y ago

It seems like Zero-3 doesn't work for them:

https://github.com/EleutherAI/gpt-neox/issues/171

stellaathena5y ago

https://github.com/microsoft/DeepSpeed/issues/846

Also, the specific problem described in that Issue was due to a bug I found in DeepSpeed that has since been corrected.

joshlk5y ago

Looks like they got it working recently https://github.com/EleutherAI/gpt-neox/pull/178

dqpb5y ago

Did you even read through the issue? I don't see anything that indicates it won't work.

1 more reply

dataangel5y ago· 3 in thread

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

jonbaer5y ago

I think there is some explanation (on the previous model?) here, https://www.youtube.com/watch?v=tC01FRB0M7w

jiofih5y ago

Third paragraph or so in the overview:

dataangel5y ago

Yeah that would be the techno-babble. I've been working on a machine learning pipeline for 6 years and I still have no idea what this means.

6 more replies

stephenroller5y ago· 2 in thread

diptanu5y ago

What are the relevant commits in Fairseq for this? I couldn't figure out the changes by looking at the commits from last week.

stephenroller5y ago

https://github.com/pytorch/fairseq/pull/3331 and https://github.com/pytorch/fairseq/pull/3327

vladf5y ago· 2 in thread

Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.

I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

gwern5y ago

SGD doesn't work on large Transformers, no. You need something like AdamW.

The_rationalist5y ago

Mish is generally superior to RadamW https://lessw.medium.com/meet-mish-new-state-of-the-art-ai-a...

1 more reply

andrewprock5y ago· 2 in thread

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

gwern5y ago

andrewprock5y ago

This feels off by a couple of orders of magnitude, unless a significant number of the parameters are not independent.

2 more replies

bevenky5y ago· 1 in thread

This is also being added to pytorch

https://github.com/pytorch/pytorch/pull/46750

minimaxir5y ago

I don't think that's the Stage 3 announced in this blog post, but it's def a framework for it.

FL33TW00D5y ago

Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.

https://huggingface.co/blog/zero-deepspeed-fairscale

ansk5y ago

alphagrep123455y ago

Simple 10 min overview/tutorial (official) if someone is interested - https://www.youtube.com/watch?v=ovQC7FqXHXk

The_rationalist5y ago

See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy?

singhrac5y ago

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhoward5y ago

please hook this up to Jax!

mchusma5y ago

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!

j / k navigate · click thread line to collapse