Every vectorized string search algorithm (including those treating an unsigned long as a "vector" of 8 bytes) currently needs a prelude that performs the search up to the first alignment boundary, and then performs the bulk of the search on well-aligned blocks, and then finally a postlude search in the tail of the string, where the tail is shorter than the block size/alignment.
Using unaligned loads, you can get rid of the prelude, including the associated branches and intptr arithmetic, and just have to deal with the tail.
If you're comparing short-ish strings, almost all of the time is spent in the prelude and postlude, even if the entire substring fits in a register. This is a silly language limitation when the hardware can actually easily just support the unaligned load.
In particular, it doesn't seem justified that what at most amounts to a tiny inefficiency in hardware turns into a very expensive class of bugs (UB).