The guide builds up to this final chapter (linked) on how to train a very large model like Llama 3.1 405B on a big cluster with plain pytorch.
Everything is just written using the direct pytorch apis (other than the model code which is just using `transformers` models).
If there are topics of interest feel free to open an issue in the repo, and contributions are welcome.
I'm investigating adding a chapter on tensor parallelism, but it's support in pytorch is still early stages.