undefined | Better HN

0 pointsserialx3y ago0 comments

No since it’s stateful in the sense that inferencing is dependent on the past generated tokens.

0 comments

7 comments · 3 top-level

brutus12133y ago· 4 in thread

I'm really curious how Meta, DeepMind and OpenAI make the big models work. The biggest A100 you can buy is just 80GB. And I assume the big companies use single precision floating point during training. Are they actually partitioning the big model across multiple GPU instances? If one had the hardware, how many GPUs does the biggest LLAMA take? These are systems issues and I have not read papers or blog posts on how this works. To me, this infra is very non-trivial.

spi3y ago

The "standard" machine for these things has 8x80GB = 640GB memory (p4de instances here: https://aws.amazon.com/ec2/instance-types/p4/), with _very_ fast connections between GPUs. This fits even a large model comfortably. Nowadays probably most training use half precision ("bf16", not exactly float16, but still 2 bytes per parameter). However during training you easily get a 10-20x factor between the number of parameters and the bytes of memory needed, due to additional things you have to store in memory (activations, gradients, etc.). So in practice the largest models (70-175B parameters) can't be trained even on one of these beefy machines. And even if you could, it would be awfully slow.

In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...

To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)

pmoriarty3y ago

In case anyone's interested, on this page[1], a P4DE 24xlarge is listed as costing $25 per hour for a reserved instance.

[1] - https://instances.vantage.sh/

1 more reply

throwawaybbq13y ago

Very insightful!! A 175B parameter model with 2 bytes per weight, and say 2 bytes per gradient (not sure if single precision gradients makes sense?) comes in at 700GB, which is beyond a single 8x80GB beefy machine!! I recall reading with tech such as RDMA, you can communicate really fast between machines .. I assume if you add a switch in there, you are toast (from a latency perspective). Perhaps using 2 such beefy machines in a pair would do the trick .. after all .. model weights aren't the only thing that needs to be on the GPU.

I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).

I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?

1 more reply

151553y ago

NVLink

GistNoesis3y ago

That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.

You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...

Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.

nullc3y ago

depends on your application, if getting many completions is useful to you then its embarrassingly parallel.

j / k navigate · click thread line to collapse

0 comments

7 comments · 3 top-level

brutus12133y ago· 4 in thread

spi3y ago

To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)

pmoriarty3y ago

In case anyone's interested, on this page[1], a P4DE 24xlarge is listed as costing $25 per hour for a reserved instance.

[1] - https://instances.vantage.sh/

1 more reply

throwawaybbq13y ago

1 more reply

151553y ago

NVLink

GistNoesis3y ago

That's why it's not parallelized along the time axis but rather along the dimension of the embedding axis.

nullc3y ago

depends on your application, if getting many completions is useful to you then its embarrassingly parallel.

j / k navigate · click thread line to collapse