CPU Cycles per theoretical minimum per byte for block move
Z80 instruction fetch 4 byte
Z80 data read/write 3 byte 6
80(1)88, V20 4 byte 8
80(1)86, V30 4 byte/word 4
80286, 80386 SX 2 byte/word 1
80386 DX 2 byte/word/dword 0.5
LDIR (etc.) are 2 bytes long, so that's 8 extra clocks per iteration. Updating the address and count registers also had some overhead.The microcode loop used by the 8086/8088 also had overhead, this was improved in the following generations. Then it became somewhat neglected since compilers / runtime libraries preferred to use sequences of vector instructions instead.
And with modern processors there are a lot of complications due to cache lines and paging, so there's always some unavoidable overhead at the start to align everything properly, even if then the transfer rate is close to optimal.