undefined | Better HN

0 pointsdragontamer7y ago0 comments

> Can they? AFAIK they can only permute values between lanes but not scatter/gather.

GPUs have a crossbar between OpenCL __shared memory and every work-item in a workgroup. Its so innate to the GPU that its almost implicit. In terms of OpenCL, the code looks like this:

    __local uint32_t gatherFoo[64];
    gatherFoo[get_local_id(0)] = fooBar();
    myFoo = gatherFoo[generate_index()];

The above is roughly equivalent to vpshufb, where generate_index() is the parameter to vpshufb.

    __local uint32_t scatterFoo[64];
    scatterFoo[generate_index()] = fooBar();
    myFoo = scatterFoo[get_local_id(0)];

The above is the equivalent to a "backwards vpshufb". AVX512 is missing this equivalent.

GPUs don't really "need" a a dedicated instruction to do this, because __local memory is a full-speed crossbar when workgroups are of size 32 (NVidia) or 64 (AMD), the native SIMD size. Its performance characteristics are not quite equivalent to vpshufb, but its still very, very, very fast in practice.

> __shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.

There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway. Ditto with AMD's __amdgcn_ds_permute() and __amdgcn_ds_bpermute() intrinsics.

EDIT: I guess __shfl_sync() and the __amdgcn_ds_permute / bpermute instructions save a step. They're smaller assembly language and more concise. But I expect the overall performance to not be much different from using LDS / Shared Memory explicity.

------------------

> I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.

People don't program CPUs for FLOPs. They program on CPUs for minimum latency.

SIMD Compute is useful on a CPU because it stays in L1 cache. L1 cache is 64kB, more than enough to have a good SIMD processor accelerate some movement. CPUs even have full bandwidth to L2 cache, which is huge these days (512kB on EPYC to 1MB on Skylake-server)

CPU-based SIMD won't ever be as big or as broad as GPU-based SIMD. But... CPU-based SIMD should become easier as Intel figures out how to adopt OpenCL or CUDA programming paradigms.

There are already many problems implemented in CPU-AVX512 which execute faster than 15.8GB/s that a PCIe x16 bus will give you. Therefore, its more efficient to execute the whole problem on the CPU, rather than transfer the data to the GPU.

0 comments

3 comments · 1 top-level

Const-me7y ago· 2 in thread

> There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway.

NVidia says the opposite is true. Here's a link: https://devblogs.nvidia.com/using-cuda-warp-level-primitives...

The data exchange is performed between registers, and more efficient than going through shared memory, which requires a load, a store and an extra register to hold the address.

If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.

> They program on CPUs for minimum latency.

Not just that. I code for CPU SIMD very often, and only occasionally for GPGPU. Even for code that would work very well on GPUs. The main reason for me is compatibility. I mostly work on desktop software, picking CUDA decreases userbase by a factor of 2 which is often not an option. But yeah, another reason is that CPU SIMD is fast enough already and spending time on PCIx IO doesn't pay off.

Update: another reason why I don't code GPGPU more is different programming model. GPU programming model makes writing device code easy, and like you mentioned earlier it even has good scalability built-in i.e. in many cases compute shaders need not to be aware of the warp size.

But the downside is upfront engineering costs.

I have to keep my data in very small number of continuous buffers. I have to upload these buffers to GPU. I have to know in advance how much VRAM do I need for output data.

I find this part much easier for CPU SIMD, on CPU I only need to design the lowest level of my data structures accordingly, but I can use anything at all on higher levels of the structures: hash maps, trees, linked graphs, they all work just fine, as long as their lower-level nodes are not too small, aligned, dense, and composed of these 128/256 bits SIMD vectors.

dragontamerOP7y ago

> If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.

You can say that again. Its still not very fast btw.

https://www.agner.org/optimize/instruction_tables.pdf

VPSCATTERDD ZMM-register is measured to be 17-clock cycles (!!!), while VPGATHERDD ZMM-register is 9-clock cycles. Gather/scatter on CPUs is very, very slow!

Its actually faster to gather/scatter through a for-loop than to actually use the VPGATHERDD or VPSCATTERDD instructions.

In contrast, Shared Memory on GPUs is a full crossbar on NVidia and AMD.

-------------

I think my details are getting a bit AMD specific. Lemme do a citation:

https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

> As a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 and ds_bpermute_b32 . They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.

The important tidbit is that the Load/Store units of AMD's GPU Core can support a gather/scatter to separate LDS memory banks at virtually no cost. This is a full crossbar that allows GPUs to swap lanes between SIMD registers in different lanes.

Intel CPUs do NOT have this feature, outside of vpshufb. I'm arguing that GPU Shared memory is of similar performance to vpshufb (a little bit slower, but still way faster than even a CPU's Gather/Scatter).

So yes, the "bpermute" and "permute" instructions on AMD are a full-crossbar and can execute within 2-clock cycles (if there are no bank conflicts). That's 64-dwords that can be shuffled around in just 2-clock cycles.

In contrast, Intel's Gather/scatter to L1 cache is 9-clocks or 17-clocks respectively.

-----------

The important thing here is to do a high-speed permute. The programmer can choose to use vpgatherdd, vpshufb, GPU permute, GPU bpermute, GPU LDS Memory, etc. etc. It doesn't matter for program correctness: it all does the same thing.

But GPUs have the highest-performance shuffle and permute operators. Even if you go through LDS Memory. In fact, the general permute operators of AMD GPUs just go through LDS memory, that's their underlying implementation!

Const-me7y ago

> while VPGATHERDD ZMM-register is 9-clock cycles.

Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.

The reason why it’s OK on GPU is massive parallelism hiding latency i.e. the hardware computes other threads instead of just waiting for data to arrive. But even on CPUs, if the requested data is not too far (in L1 or L2), hyperthreading in most modern CPUs does acceptable job in this situation.

> Intel CPUs do NOT have this feature, outside of vpshufb

Yes they have. If you prefer assembly, look vpermps instruction. It can permute lanes arbitrary even across 128-bit lanes (this is unlike vpshufb/vshufps/etc.), with permutation indices being taken from another vector register. Quite fast, specifically on Haswell/Broadwell/Skylake it’s 3 cycles latency, 1 cycle throughput, single micro-op.

1 more reply

j / k navigate · click thread line to collapse