Good question! There are two separate issues with putting the GPU in the same package as the CPU. One is the memcpy bandwidth issue, which is indeed entirely mitigated (assuming the app is smart enough to exploit this). But the round trip times seem more related to context switches. I have an M1 Max here, and just found ~200µs for a very simple dispatch (just clearing 16k of memory).
I personally believe it may be possible to reduce latency using techniques similar to io_uring, but it may not be simple. Likely a major reason for the roundtrips is so that a trusted process (part of the GPU driver) can validate inputs from untrusted user code before it's presented to the GPU hardware.