undefined | Better HN

0 pointsConst-me7y ago0 comments

Only graphics shaders.

Advanced compute shaders very often need to synchronize local data within warp. When you program such shaders you have to be aware about SIMD width (32 on NVidia and Intel, 64 on AMD), and design both algorithms and data structures accordingly. Failing to do so often have significant performance costs, I saw up to 10x speedup after implementing cooperative algorithm instead of straightforward one.

0 comments

5 comments · 2 top-level

DesiLurker7y ago· 3 in thread

I have done a bit of simd programming in past & I can tell you that its quite common to expect to rewrite your simd optimized code as a new architecture come along. sometimes you do it even as you go between different iterations of the same arch (like cortex-A8 vs A9) because of different instruction timing (and sometimes bugs). In general, asking hardware to auto-optimize around your code doesn't works except for simple problems & even then you are likely leaving performance on table.

What I really want is a lots of different types of vector & permute ops and the ability to reconfigure the simd unit on a dime when I am done with a specific type of compute (like crypto) and switch to another type (like signal processing).

atq21197y ago

Well, that makes sense if you want to squeeze the last bit of performance out of highly optimized code.

However, there's a case to be made that, in order to utilize our hardware better, we should be using vector units much more often. To make that feasible, we need a good programming paradigm that doesn't have to be rewritten for a different architecture. If that ends up not utilizing the hardware perfectly, that's okay: using a 256-bit vector unit even at 50% of the potential performance is still many times faster than scalar code.

dragontamer7y ago

GPU Coders haven't changed their code in the last 10 years, even as NVidia changed their architecture repeatedly.

PTX Assembly from NVidia still runs on today's architectures. I think this variable-length issue they focus on so much is a bit of a red-herring: NVidia always was 32-way SIMD but the PTX Code remains portable nonetheless.

The power is that PTX Assembly (and AMD's GCN Assembly) has a scalar-model of programming, but its execution is vectorized. So you write scalar code, but the programmer knows (and assumes it to be) in a parallel context. EDIT: I guess PTX is technically interpreted: the number of registers is not fixed, etc. etc. Nonetheless, the general "SIMD-ness" of PTX is static, and has survived a decade of hardware changes.

There are a few primitives needed for this to work: OpenCL's "Global Index" and "Local Index" for example. "Global Index" is where you are in the overall workstream, while "Local Index" is useful because intra-workgroup communications are VERY VERY FAST.

And... that's about it? Really. I guess there are a bunch of primitives (the workgroup swizzle operations, "ballot", barrier, etc. etc.), but the general GPU model is actually kinda simple.

-----------

I see a lot of these CPU architecture changes, but none of them really seem to be trying to learn from NVidia or AMD's model. A bit of PTX-assembly or GCN Assembly probably would do good to the next generation of CPU Architects.

2 more replies

Const-meOP7y ago

> that doesn't have to be rewritten for a different architecture

CPU architectures are quite stable. SSE2 is almost 20 years old now. You can't even run modern Windows on a system which doesn't support it.

Vectorize to SSE and you'll get your 50% of potential performance. You can do it without any new paradigms, C and C++ support SSE intrinsics for decades already, other languages are catching up.

1 more reply

Joky7y ago

If you want to do an analogy with GPU compute shaders, I think it is more accurate to compare to how GPUs can scale the number of cores without (potentially) the need to recompile, as long as enough blocks are scheduled.

This is orthogonal to the fact that these are warp-size aware I believe.

j / k navigate · click thread line to collapse