SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011) (opens in new tab)

(yosefk.com)

138 pointsshipp022y ago37 comments

37 comments

28 comments · 7 top-level

narrowbyte2y ago· 8 in thread

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

raphlinus2y ago

One of the other major things that's changed is that Nvidia now has independent thread scheduling (as of Volta, see [1]). That allows things like individual threads to take locks, which is a pretty big leap. Essentially, it allows you to program each individual thread as if it's running a C++ program, but of course you do have to think about the warp and block structure if you want to optimize performance.

I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

yosefk2y ago

It's improtant that GPU threads support locking and control flow divergence and I don't want to minimize that, but threads within a warp diverging still badly loses throughput, so I don't think the situation I'd fundamentally different in terms of what the machine is good/bad at. We're just closer to the base architecture's local maximum of capabilities, as one would expect for a more mature architecture; various things it could be made to support it now actually supports because there was time to add this support

narrowbyte2y ago

I intentionally said "more towards embarrassingly parallel" rather than "only embarrassingly parallel". I don't think there's a hard cutoff, but there is a qualitative difference. One example that springs to mind is https://github.com/simdjson/simdjson - afaik there's no similarly mature GPU-based JSON parsing.

1 more reply

xoranth2y ago

> That allows things like individual threads to take locks, which is a pretty big leap.

Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?

1 more reply

majke2y ago

Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?

narrowbyte2y ago

The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.

1 more reply

majke2y ago

vpgatherdd - I think that for newer CPUs it is faster than many loads + inserts, but if you are going to fault a lot, then it becomes slow.

> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.

https://www.intel.com/content/www/us/en/content-details/8141...

ribit2y ago

Modern GPUs are exposing the SIMD behind the SIMT model and heavily investing into SIMD features such as shuffles, votes, and reduces. This leads to an interesting programming model. One interesting challenge is that flow control is done very differently on different hardware. AMD has a separate scalar instruction pipeline which can set the SIMD mask. Apple uses an interesting per-lane stack counter approach where value of zero means that the lane is active and non-zero value indicates how many blocks need to be exited for the thread to become active again. Not really sure how Nvidia does it.

Remnant442y ago· 4 in thread

I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.

Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.

yosefk2y ago

The basic architecture still matters. GPUs still lose throughput upon divergence regardless of their increased ability to run more kinds of divergent flows correctly due to having separate PCs, and SIMD still has more trouble with instruction latency (including due to bank conflict resolution in scatter/gather) than barrel threaded machines, etc. This is not to detract from the importance of the improvements to the base architecture made over time

Remnant442y ago

agreed! The basic categories remain, just blurring a bit at the edges.

anonymoushn2y ago

After primarily using AVX2, I don't think masked instructions and scatter/gather are particularly useful. Emulating masked computations with a blend is cheap. Emulating compress and some missing shuffles is expensive. Masked stores and loads don't really help with anything except for an edge case where they don't cause page faults on the part that was masked out.

petermcneeley2y ago

On the gpu a masked out load is a nop. It certainly is better. And scatter functionality is probably quite painful to emulate without the intrinsics.

Const-me2y ago· 3 in thread

> How many register sets does a typical SMT processor have? Er, 2, sometimes 4

Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.

For example, my laptop has AMD Zen3 processor. Each core has 192 scalar physical registers, while only ~16 general-purpose scalar registers defined in the ISA. This gives 12 register sets; they are shared by both threads running on the core.

Similar with SIMD vector registers. Apparently each core has 160 32-byte vector registers. Because AVX2 ISA defines 16 vector register, this gives 10 register sets per core, again shared by 2 threads.

yosefk2y ago

I meant register sets visible to software. The fact that there's even more hardware not visible to software that you need for a thread to run fast just means that the cost of adding another thread is even higher

atq21192y ago

Depends on whether that hardware is shared between threads or not. Physical registers are usually shared.

xoranth2y ago

I believe the author is referring to how many logical threads/hyperthreads can a core run (for AMD and Intel, two. I believe POWER can do 8, Sparc 4).

The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.

HALtheWise2y ago· 3 in thread

For a SIMD architecture that supports scatter/gather and instruction masking (like Arm SVE), could a compiler or language allow you to write "Scalar-style code" that compiles to SIMD instructions? I guess this is just auto-vectorization, but I'd be interested in explicit tagging of code regions, possibly in combination with restrictions on what operations are allowed.

yosefk2y ago

Check out ispc/spmd by Matt Pharr, a very interesting take on this subject

doophus2y ago

Yes, have a look at ISPC - it's amazing. I especially like that it can generate code for multiple architectures and then select the best implementation at runtime for the CPU it's running on.

xoranth2y ago

Do you know any good tutorial for ISPC? Documentation is a bit sparse.

jabl2y ago· 3 in thread

A couple of related questions:

- It has been claimed that several GPU vendors behind the covers convert the SIMT programming model (graphics shaders, CUDA, OpenCL, whatever) into something like a SIMD ISA that the underlying hardware supports. Why is that? Why not have something SIMT-like as the underlying HW ISA? Seems the conceptual beauty of SIMT is that you don't need to duplicate the entire scalar ISA for vectors like you need with SIMD, you just need a few thread control instructions (fork, join, etc.) to tell the HW to switch between scalar or SIMT mode. So why haven't vendors gone with this? Is there some hidden complexity that makes SIMT hard to implement efficiently, despite the nice high level programming model?

- How do these higher level HW features like Tensor cores map to the SIMT model? It's sort of easy to see how SIMT handles a vector, each thread handles one element of the vector. But if you have HW support for something like a matrix multiplication, what then? Or does each SIMT thread have access to a 'matmul' instruction, and all the threads in a warp that run concurrently can concurrently run matmuls?

xoranth2y ago

It is the same reason in software sometimes you batch operations:

When you add two numbers, the GPU needs to do a lot more stuff besides the addition.

If you implemented SIMT by having multiple cores, you would need to do the extra stuff once per core, so you wouldn't save power (and you have a fixed power budget). With SIMD, you get $NUM_LANES additions, but you do the extra stuff only once, saving power.

(See this article by OP, which goes into more details: https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... )

ribit2y ago

How would you envision that working at the hardware level? GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible. SIMD is a natural way to implement this. In real world, SIMT is just SIMD with some additional capabilities for control flow and a programming model that focuses on SIMD lanes as threads of execution.

What’s interesting is that modern SIMT is exposing quite a lot of its SIMD underpinnings, because that allows you to implement things much more efficiently. A hardware-accelerated SIMD sum is way faster than adding values in shared memory.

avianes2y ago

> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations

1 more reply

mkoubaa2y ago

This type of parallelism is sort of like a flops metric. Optimizing the amount of wall time the GPU is actually doing computation is just as important (if not more). There are some synchronization and pipelining tools in CUDA and Vulkan but they are scary at first glance.

James_K2y ago

> Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

Got me laughing at the first line.

j / k navigate · click thread line to collapse

37 comments

28 comments · 7 top-level

narrowbyte2y ago· 8 in thread

quite interesting framing. A couple things have changed since 2011

- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

- likewise for pervasive masking support and "Single instruction, multiple flow paths"

raphlinus2y ago

[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...

yosefk2y ago

narrowbyte2y ago

1 more reply

xoranth2y ago

> That allows things like individual threads to take locks, which is a pretty big leap.

1 more reply

majke2y ago

Last time i looked at intel scatter/gather I got the impression it only works for a very narrow use case, and getting it to perform wasn’t easy. Did I miss something?

narrowbyte2y ago

The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."

I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.

1 more reply

majke2y ago

vpgatherdd - I think that for newer CPUs it is faster than many loads + inserts, but if you are going to fault a lot, then it becomes slow.

> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.

https://www.intel.com/content/www/us/en/content-details/8141...

ribit2y ago

Remnant442y ago· 4 in thread

I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

yosefk2y ago

Remnant442y ago

agreed! The basic categories remain, just blurring a bit at the edges.

anonymoushn2y ago

petermcneeley2y ago

On the gpu a masked out load is a nop. It certainly is better. And scatter functionality is probably quite painful to emulate without the intrinsics.

Const-me2y ago· 3 in thread

> How many register sets does a typical SMT processor have? Er, 2, sometimes 4

Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.

yosefk2y ago

atq21192y ago

Depends on whether that hardware is shared between threads or not. Physical registers are usually shared.

xoranth2y ago

I believe the author is referring to how many logical threads/hyperthreads can a core run (for AMD and Intel, two. I believe POWER can do 8, Sparc 4).

The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.

HALtheWise2y ago· 3 in thread

yosefk2y ago

Check out ispc/spmd by Matt Pharr, a very interesting take on this subject

doophus2y ago

Yes, have a look at ISPC - it's amazing. I especially like that it can generate code for multiple architectures and then select the best implementation at runtime for the CPU it's running on.

xoranth2y ago

Do you know any good tutorial for ISPC? Documentation is a bit sparse.

jabl2y ago· 3 in thread

A couple of related questions:

xoranth2y ago

It is the same reason in software sometimes you batch operations:

When you add two numbers, the GPU needs to do a lot more stuff besides the addition.

(See this article by OP, which goes into more details: https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... )

ribit2y ago

avianes2y ago

> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

1 more reply

mkoubaa2y ago

James_K2y ago

> Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

Got me laughing at the first line.

j / k navigate · click thread line to collapse