have you actually benchmarked that? I think (someone please correct me if I'm way off here) the AMX instructions can hit ~2.8tflops (fp16) per co-processor and there are 2 on the 7-core M1. That's 5.6tflops vs the 4.6tflops the GPU can hit.
Yeah that's within the M1 family, but get within dGPUs and it doesn't even come close.
30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.