undefined | Better HN

0 pointsleogao5y ago0 comments

My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff. That seems to imply that ZeRO is model+pipeline parallel plus a bunch of miscellaneous bits. But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

0 comments

1 comments · 1 top-level

BadInformatics5y ago

> My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff

This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)

> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.

j / k navigate · click thread line to collapse