> My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff
This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)
> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.
Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.