> > Intel CPUs do NOT have this feature, outside of vpshufb
> Yes they have.
No they don't, but its very tricky to see why. You're blinded by vpshufb, and can't see how it can fail to solve some problems.
Lets take stream compaction as an example problem.
http://www.cse.chalmers.se/%7Euffe/streamcompaction.pdf
Stream Compaction can be used to remove redundant whitespace from strings, or to "compress" the raytracer rays so that they are all able to be read by a simple load (as opposed to a gather/scatter operation). How do you perform stream compaction?
Well, its a straightforward scatter operation: https://i.imgur.com/aIoO8dm.png
Now, how do you do this using vpshufb? You can't. vpshufb is backwards, and won't efficiently solve this stream compaction problem. GPUs can solve this problem very efficiently, but Intel's current implementation of AVX512 is missing the "backwards" vpshufb command, to perform this operation.
Or as I've been trying to say: vpshufb is equivalent to a "gather" over SIMD Registers. But Intel is MISSING a scatter over SIMD Registers. The instruction is just... not there. I've looked for it, and it doesn't exist. As such, GPU-code (such as the stream compaction algorithm) CANNOT be implemented efficiently on a CPU.
I mean, you can use vpscatterdd to implement it, but again... vpscatterdd is 17-clock cycles. That's way too slow.
> Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The L1 cache has the bandwidth to do it, it just isn't wired up correctly for this mechanism. Intel's load/store units can read/write 32-bytes at a time, in parallel across 2xload units + 1x store unit.
But the thing is: writing many small "chunks" of 4-bytes here and there (ie: a vpgatherdd / vpscatterdd operation) is not a contiguous group of 32-bytes. Therefore, Intel's cores lose a LOT of bandwidth in this case.
GPUs on the other hand, have a great-many number of load/store units. Effectively one per SIMD unit. As such, reading / writing to LDS "shared" memory on AMD GPUs can be done 32-at-a-time.
So the equivalent "vpgatherdd" over LDS cache will execute in something like 2-clock ticks on AMD GPUs (assuming no bank conflicts), while it'd take 9-clock ticks on Intel cores.
Again, LDS cache is so fast on GPUs, that it is effectively functioning as if any LDS-load/store is as fast as a vpshufb instruction. (Not quite: vpshufb is 1-clock tick, and doesn't have to worry about bank-conflicts. So vpshufb is still faster... but GPUs gather/scatter capabilities are downright incredible)
How long before you think Intel will implement a true crossbar so that the vpgatherdd and vpscatterdd instructions can actually execute quickly on a CPU?
-----------
GPUs actually implement a richer language for data-movement. Intel could very easily fix this problem by writing a "backwards vpshufb" instruction, but I'm not aware of anything that exists like that... or any plans to implement something like that.