A really robust memmove library routine should handle about eleven different factors, one of which is alignment. I don't know of ANY library that handled that right, probably because its so hard. E.g. unaligned source, unaligned dest with Different alignment is very hard. Usually they settle on aligning the destination (unaligned cache writes are more expensive). The true solution is to load the partial source, then loop loading whole aligned source words, shifting values in multiple registers to create aligned destination words to store.
That all requires about 16 different unrolled code loops to cover all the cases. Nobody bothers. So nobody every got the best performance in a general memmove anywhere. Sigh.