https://github.com/microsoft/DeepSpeed/issues/846
Also, the specific problem described in that Issue was due to a bug I found in DeepSpeed that has since been corrected.
> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency
In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.
In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.
For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important.
If someone could ELI5 how ZeRO actually works, that would be nice.
I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.