undefined | Better HN

Skip to content

Top Best Ask Show New Jobs

0 pointsKon-Peki4y ago0 comments

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

0 comments

4 comments · 1 top-level

my1234y ago· 3 in thread

AMX is indeed very nice for FP64 where customer GPUs aren't an alternative at all.

However, for lower precisions (which is what deep learning uses), you're much better off with a GPU.

have you actually benchmarked that? I think (someone please correct me if I'm way off here) the AMX instructions can hit ~2.8tflops (fp16) per co-processor and there are 2 on the 7-core M1. That's 5.6tflops vs the 4.6tflops the GPU can hit.

my1234y ago

Yeah that's within the M1 family, but get within dGPUs and it doesn't even come close.

30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.

johndough4y ago

Often the limiting factor is memory bandwidth instead of raw FLOPS, so dealing with 4 times larger data types (FP64 vs FP16) is a disadvantage.

j / k navigate · click thread line to collapse