To perform a multiplication on CPU, even SIMD, that values have to fetched and converted to a form the CPU has multipliers for. This means smaller numeric types penalised. For a 128-bit memory bus, an NPU can fetch 32 4-bit values per transfer; the best case for a CPU is 16 8-bit values.
Details are scant on Microsoft's NPU, but it probably has many parallel multipliers; either in the form of tensor cores or a systolic array. The effective number of matmul's per second (or per memory operation) is higher.