undefined | Better HN

0 pointsreissbaker1y ago0 comments

Basically no one uses FP32 at inference time. BF16/FP16 is typically considered unquantized, whereas FP8 is lightly quantized. That being said there's pretty minimal quality loss at FP8 compared to 16-bit typically; Llama 3.1 405b, for example, only benchmarks around ~1% worse when run at FP8: https://blog.vllm.ai/2024/07/23/llama31.html

Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.

Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/

0 comments

tarasglek1y ago

SambaNova does bf16

j / k navigate · click thread line to collapse