undefined | Better HN

0 pointsnpn16d ago0 comments

you think in FP16. nobody uses FP16 for inference anymore. 400% probably for FP4/INT4 computation.

0 comments

Tensor core performance is inversely proportional to precision across all generations (i.e., reducing precision by a factor of 2 increases OPS by a factor of 2). 8-bit precision will give you the same improvement ratio. A100/H100 didn't support 4-bit if I remember correctly.

So FP4/INT4 will likely improve the same 30% OPS/W. You could get a separate improvement by reducing precision, but going 1-bit for 4x improvement feels unlikely for now.

j / k navigate · click thread line to collapse

0 comments

EvgeniyZh16d ago

So FP4/INT4 will likely improve the same 30% OPS/W. You could get a separate improvement by reducing precision, but going 1-bit for 4x improvement feels unlikely for now.

j / k navigate · click thread line to collapse