GPUs have a crossbar between OpenCL __shared memory and every work-item in a workgroup. Its so innate to the GPU that its almost implicit. In terms of OpenCL, the code looks like this:
__local uint32_t gatherFoo[64];
gatherFoo[get_local_id(0)] = fooBar();
myFoo = gatherFoo[generate_index()];
The above is roughly equivalent to vpshufb, where generate_index() is the parameter to vpshufb. __local uint32_t scatterFoo[64];
scatterFoo[generate_index()] = fooBar();
myFoo = scatterFoo[get_local_id(0)];
The above is the equivalent to a "backwards vpshufb". AVX512 is missing this equivalent.GPUs don't really "need" a a dedicated instruction to do this, because __local memory is a full-speed crossbar when workgroups are of size 32 (NVidia) or 64 (AMD), the native SIMD size. Its performance characteristics are not quite equivalent to vpshufb, but its still very, very, very fast in practice.
> __shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.
There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway. Ditto with AMD's __amdgcn_ds_permute() and __amdgcn_ds_bpermute() intrinsics.
EDIT: I guess __shfl_sync() and the __amdgcn_ds_permute / bpermute instructions save a step. They're smaller assembly language and more concise. But I expect the overall performance to not be much different from using LDS / Shared Memory explicity.
------------------
> I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.
People don't program CPUs for FLOPs. They program on CPUs for minimum latency.
SIMD Compute is useful on a CPU because it stays in L1 cache. L1 cache is 64kB, more than enough to have a good SIMD processor accelerate some movement. CPUs even have full bandwidth to L2 cache, which is huge these days (512kB on EPYC to 1MB on Skylake-server)
CPU-based SIMD won't ever be as big or as broad as GPU-based SIMD. But... CPU-based SIMD should become easier as Intel figures out how to adopt OpenCL or CUDA programming paradigms.
There are already many problems implemented in CPU-AVX512 which execute faster than 15.8GB/s that a PCIe x16 bus will give you. Therefore, its more efficient to execute the whole problem on the CPU, rather than transfer the data to the GPU.