Also, that "awful" 1MB memcpy is likely all in L3 cache these days. But even if it weren't in cache, we're talking about an operation that takes 50 microseconds (1MB read + 1MB written == 25microseconds + 25 microseconds).
Given that modern CPUs have like 16+ MBs of L3 cache (and more), and some mainstream desktop CPUs have 1MB of L2 cache... its very possible that this memcpy is far faster in practice than you can imagine.
1MB is big, its a million bytes. But CPUs are on the scale of billions, so 1MB is actually rather small by modern standards. Its surprisingly difficult to get intuition correct these days...