Like total latency, no point to offload something you can finish processing on CPU faster than transferring to GPU and back.
Some systems just don't have GPUs, and there's nothing you can do about it.
Sometimes CPUs are simply much faster due to a branchy serial algorithm. However, you might still be able use SIMD to get some speedup.
Sometimes I end up going single threaded SIMD, if the whole system is memory bandwidth limited anyways. Work stealing queues can also be great. Thread per CPU core pulling work from a common pool. You might be able to do some rough data locality based scheduling to reuse cache hierarchy contents.
Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.
Not trying to say anything particular, except that solution space has some options. That there are no silver bullets. The solutions you suggested can also be great.
---------
> Overall, I feel the biggest challenges often come from cache and memory bandwidth management. CPUs are fast, but SDRAM is not. You don't want different threads fighting for CPU socket local resources and even less for global ones. I usually do rough estimates of required bandwidth and computation, write some prototypes and do a lot of profiling, including taking a good hard look at the CPU counters.
I think memory-layout is the #1 issue these days. CPUs / GPUs have so much compute available that its almost impossible to actually achieve high utilization. In most cases, you're sitting around just waiting for memory...
CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter" to __local or __shared__ memory for years (ex: GPUs have 32 banks and 32-load/store units per GPU-compute unit or NVidia SM: that's either 1/2 or 1 load/store unit per GPU shader. AVX512 Skylake however has 3-load/store units across 16 SIMD-threads...)
Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement. It seems like a lot of data-movement in the AVX world is still best handled by AVX -> L1 cache -> back into AVX register (which is limited by the very few load/store units in modern CPUs).
Yeah, you can cheat a lot of cases through pshufb, but that instruction doesn't always work. There's something to be said about the brute-force option of 32x load/store units on a GPU-unit and sticking 32-load/store units for all the threads to leverage.
Absolutely agree. Cache lines should be packed with data that is useful together. Memory streaming access patterns should be favored.
> CPU memory movement is still subpar compared to GPUs. AVX512 finally implements "scatter" operations, but GPUs have had highly-optimized "gather-scatter"...
Well, it goes both ways. CPU gather/scatter may be slow, but GPU memory access latency is astronomically high — talking about microseconds. Of course GPUs mask the latency with a ton of hardware threads. CPUs are memory access latency kings by far. GPUs do have amazing memory controllers when you have gather/scatter access patterns, as long as high latency is acceptable.
> Intel really needs to write more instructions like "pshufb" to handle more ways for register-to-register movement.
Yeah, it'd be useful, but not so critical when you're memory bound anyways. I often find myself having a lot of "free computation slots" for data shuffling while the CPU is waiting for the memory. Or in other words, memory stalls.