> The M4 Max has 546GB/s of memory bandwidth and ~34TFLOPS (fp16) = ~68 GB/s, a ratio of ~8.02. Whereas NVIDIA RTX 4090 has 1008GB/s memory bandwidth and ~330TFLOPS (fp16) = ~660GB/s, a ratio of ~1.52.
Why are we comparing FP16 performance when you're inferencing INT4 quantized models? Seems like a misleading figure to compare with when it's not really even the performance you're measuring.