undefined | Better HN

0 pointsConst-me7y ago0 comments

> Those are fundamental design tradeoffs of the GPU architecture. Not of PTX Assembly language.

PTX is not general purpose, it was specifically designed for that architecture, and it incorporates these tradeoffs.

> far more advanced than any SIMD I've seen implemented on a CPU.

Less advanced than normal non-SIMD branching available to CPU cores. A lot of practical algorithms need both SIMD compute and scalar branching.

> That's what the CudaMalloc function does.

CudaMalloc function can't be called from GPU code. You can't allocate RAM at all from inside your algorithm, there's no stack, no heap, nothing. You have to know in advance how much RAM do you need. For some practical problems this is a showstopper, e.g. try to implement unzip in CUDA.

> What do you mean by "limited write access"?

You only have fixed-length arrays/textures. Global memory barriers are very slow. There're producer-consumer buffers but practically speaking they're merely thread safe cursors over statically sized buffer.

GPUs can multiply dense matrices very fast, but many practical problems are different, e.g. for sparse matrices GPUs don't deliver value performance wise. SIMD on CPU is often very useful for such problems, but yes, programming model is different, lower level and more complex. No free lunches.

> Programmers don't want to "think" in SIMD.

I'm a programmer and I like SIMD.

> even though vpshufb is effectively a gather instruction over an AVX register

On GPU it would be because you don't care about latency you just spawn more threads.

On CPU it's not because shuffle_epi8 is 1 cycle latency instruction, and RAM access is much slower, if you'll think they're equivalent you'll miss the performance difference.

> CPUs are missing very, very few assembly instructions before they can run like a GPU

Even if you'll add these few instructions, GPUs will still be much faster, by orders of magnitude. Hardware is too different but it's not the instructions, CPU is spending transistors minimizing latency (caches and their sync, branch prediction, speculative execution, etc.) GPUs don't care about latency.

It's not instruction set that allowed simple programming model on GPUs. It's fundamentally different tradeoffs in hardware.

0 comments

4 comments · 2 top-level

dragontamer7y ago· 2 in thread

I think I overcomplicated my previous post. Lemme cut back the cruft and simplify. I think CPUs (Specifically AVX512) should implement the following instructions:

1. Barriers and Workgroups -- Scale SIMD UP, not downwards. The variable-length vector (discussed in this article) is backwards to the current programming model. GPU Programmers combine lanes with the concept of a OpenCL workgroup or CUDA Thread Block, and it works pretty well in my experience.

2. Implement AMD GCN-style branching with S_CBRANCH_FORK and S_CBRANCH_JOIN. This will accelerate branching when SIMD-lanes diverge in execution paths.

3. Implement "backwards vpshufb". GPUs can gather or scatter values between lanes, while CPUs can only gather data between lanes (with vpshufb). Intel AVX512 is missing an obvious and very important instruction for high-speed communication between SIMD lanes.

Const-meOP7y ago

I agree these changes would be nice. OpenCL and similar would probably work faster on CPUs with these instructions.

I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.

> GPUs can gather or scatter values between lanes

Can they? AFAIK they can only permute values between lanes but not scatter/gather.

__shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.

dragontamer7y ago

> Can they? AFAIK they can only permute values between lanes but not scatter/gather.

GPUs have a crossbar between OpenCL __shared memory and every work-item in a workgroup. Its so innate to the GPU that its almost implicit. In terms of OpenCL, the code looks like this:

    __local uint32_t gatherFoo[64];
    gatherFoo[get_local_id(0)] = fooBar();
    myFoo = gatherFoo[generate_index()];

The above is roughly equivalent to vpshufb, where generate_index() is the parameter to vpshufb.

    __local uint32_t scatterFoo[64];
    scatterFoo[generate_index()] = fooBar();
    myFoo = scatterFoo[get_local_id(0)];

The above is the equivalent to a "backwards vpshufb". AVX512 is missing this equivalent.

GPUs don't really "need" a a dedicated instruction to do this, because __local memory is a full-speed crossbar when workgroups are of size 32 (NVidia) or 64 (AMD), the native SIMD size. Its performance characteristics are not quite equivalent to vpshufb, but its still very, very, very fast in practice.

> __shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.

There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway. Ditto with AMD's __amdgcn_ds_permute() and __amdgcn_ds_bpermute() intrinsics.

EDIT: I guess __shfl_sync() and the __amdgcn_ds_permute / bpermute instructions save a step. They're smaller assembly language and more concise. But I expect the overall performance to not be much different from using LDS / Shared Memory explicity.

------------------

> I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.

People don't program CPUs for FLOPs. They program on CPUs for minimum latency.

SIMD Compute is useful on a CPU because it stays in L1 cache. L1 cache is 64kB, more than enough to have a good SIMD processor accelerate some movement. CPUs even have full bandwidth to L2 cache, which is huge these days (512kB on EPYC to 1MB on Skylake-server)

CPU-based SIMD won't ever be as big or as broad as GPU-based SIMD. But... CPU-based SIMD should become easier as Intel figures out how to adopt OpenCL or CUDA programming paradigms.

There are already many problems implemented in CPU-AVX512 which execute faster than 15.8GB/s that a PCIe x16 bus will give you. Therefore, its more efficient to execute the whole problem on the CPU, rather than transfer the data to the GPU.

1 more reply

verall7y ago

Ptx is NOT architecture specific and is jit into native by gpu driver...

j / k navigate · click thread line to collapse