You split the big matrices into smaller matrices to dispatch the workload. But this means you have to add some communication overhead (roughly nblayers sequential synchronisation point per token). In official LLama implementation this is done transparently using RowParallelLinear, ColumnParallelLinear, ParallelEmbedding see https://github.com/facebookresearch/llama/blob/main/llama/mo...
Transformer have multiple attention heads, that can be computed independently and then summed together to produce the output of the layer. This allow to split the parameter space among machines without having to transfer them at each iteration.
In practice, they typically use servers with clusters of these machines, up to about 1000 GPUs in total (so around 80TB of memory, give or take a few?). This allows even the biggest models to be trained on large batches of several hundreds, or even thousands, of elements (the total memory usage is _not_ proportional to the product of number of parameters and the batch size, but it does increase as a function of both of them, a term of which being indeed the product of the two). It makes for some very tricky engineering choices to make just the right data travel across connections, trying to avoid as much as possible that you have to sync large amount of data between different machines (so "chunking" things to stay on the 640GB range) with strategies such as ZeRO being published every now and then. Plus of course the practical effort to make physical connections as fast as possible...
To get an idea of how hard these things are, take a look at how long the list of names in the published paper about BLOOM language model is :-)
I saw a reference that said GPT-3, with 96 decoder layers, was trained on a 400 GPU cluster, so that seems like the ballpark for a 175B parameter model. That's 50 of the hypothetical machines we talked about (well .. really 100 for GPT-3 since back in those days, max was 40 or 48 GB per GPU).
I also wonder why NVIDIA (or Cerebras) isn't beefing up GPU memory. If someone sold a 1TB GPU, they could charge a 100grand easy. As I understood it, NVIDIA's GPU memory is just HBM-6 .. so they'd make a profit?