Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.
Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/