undefined | Better HN

0 pointszachthewf5y ago0 comments

It doesn't sound like techno-babble to me. They've distributed storage across nodes rather than replicating on each node, hence the model size is now scalable with number of nodes rather than being limited to what could be stored on a single node.

0 comments

p1esk5y ago

But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.

liuliu5y ago

They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.

p1esk5y ago

They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.

j / k navigate · click thread line to collapse

0 comments

p1esk5y ago

But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.

liuliu5y ago

They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.

p1esk5y ago

They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.

j / k navigate · click thread line to collapse