Absolutely agree. Cache lines should be packed with data that is useful together. Memory streaming access patterns should be favored.
> CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter"...
Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.
> Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement.
Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.