Not sure how they do it specifically for LLMs, but you can do what is called model or tensor parallelism where you can split a layer over multiple GPUs or even nodes.
If you look under the hood it's the same distributed matrix multiplication stuff with MPI, as far as I know.
I think Deepspeed has bespoke transformer kernels which handle this stuff specifically.