Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
undefined | Better HN
0 points
zachthewf
5y ago
0 comments
Share
It doesn't sound like techno-babble to me. They've distributed storage across nodes rather than replicating on each node, hence the model size is now scalable with number of nodes rather than being limited to what could be stored on a single node.
0 comments
default
newest
oldest
p1esk
5y ago
But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.
liuliu
5y ago
They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.
p1esk
5y ago
They did all that before:
https://arxiv.org/abs/2101.06840
, but they could only fit a model with 13B weights on a single V100.
j
/
k
navigate · click thread line to collapse