MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).
Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.
Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.
So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.
To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.