It faulted when I used it the 1st time. Fixed the bug (alignment of source), ran again and it faulted again.
So I spent 10 minutes writing a test - move 0-128 bytes from source buffer offset 0-128 to destination buffer offset 0-128. Simple, overkill right?
11 bugs later the damned memory copy thing worked. 11.
The next thing to ask is, What did I learn from that bug? What I learned is, accept NO CODE as bug-free, no matter the source, no matter what authoritative base it came from.
Other learning: why oh why don't CPU designers put a damned memory-copy instruction into the machine? We all need it, all the time, for every project and we all hack something together that works until it doesn't. Sigh.