TLDR: SIMD intrinsics are great. But, their performance in debug builds is surprisingly bad.
The comments in the SSE2 version are a bit odd as it references MMX, and the Pentium M and Efficeon CPUs. Those CPUs are ancient -- 2003/2004 era. The vectorized code you have also uses SSE2 and not MMX, which is important since SSE2 is double the width and has different performance characteristics from MMX. IIRC, Intel CPUs didn't start supporting SHA until ~2019 with Ice Lake, so the target for non-hardware-accelerated vectorized SHA1 for Intel CPUs would be mostly Skylake-based.