- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD
- likewise for pervasive masking support and "Single instruction, multiple flow paths"
In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.
I disagree that SIMT is only for embarrassingly parallel problems. Both CUDA and compute shaders are now used for fairly sophisticated data structures (including trees) and algorithms (including sorting).
[1]: https://developer.nvidia.com/blog/inside-volta/#independent_...
Does anyone know how those get translated into SIMD instructions. Like, how do you do a CAS loop for each lane where each lane can individually succeed or fail? What happens if the lanes point to the same location?
I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.
> The VGATHER instructions are implemented as micro-coded flow. Latency is ~50 cycles.
https://www.intel.com/content/www/us/en/content-details/8141...
For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.
Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.
SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.
Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.
For example, my laptop has AMD Zen3 processor. Each core has 192 scalar physical registers, while only ~16 general-purpose scalar registers defined in the ISA. This gives 12 register sets; they are shared by both threads running on the core.
Similar with SIMD vector registers. Apparently each core has 160 32-byte vector registers. Because AVX2 ISA defines 16 vector register, this gives 10 register sets per core, again shared by 2 threads.
The extra physical registers are there for superscalar execution, not for SMT/hyperthreading.
- It has been claimed that several GPU vendors behind the covers convert the SIMT programming model (graphics shaders, CUDA, OpenCL, whatever) into something like a SIMD ISA that the underlying hardware supports. Why is that? Why not have something SIMT-like as the underlying HW ISA? Seems the conceptual beauty of SIMT is that you don't need to duplicate the entire scalar ISA for vectors like you need with SIMD, you just need a few thread control instructions (fork, join, etc.) to tell the HW to switch between scalar or SIMT mode. So why haven't vendors gone with this? Is there some hidden complexity that makes SIMT hard to implement efficiently, despite the nice high level programming model?
- How do these higher level HW features like Tensor cores map to the SIMT model? It's sort of easy to see how SIMT handles a vector, each thread handles one element of the vector. But if you have HW support for something like a matrix multiplication, what then? Or does each SIMT thread have access to a 'matmul' instruction, and all the threads in a warp that run concurrently can concurrently run matmuls?
When you add two numbers, the GPU needs to do a lot more stuff besides the addition.
If you implemented SIMT by having multiple cores, you would need to do the extra stuff once per core, so you wouldn't save power (and you have a fixed power budget). With SIMD, you get $NUM_LANES additions, but you do the extra stuff only once, saving power.
(See this article by OP, which goes into more details: https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... )
What’s interesting is that modern SIMT is exposing quite a lot of its SIMD underpinnings, because that allows you to implement things much more efficiently. A hardware-accelerated SIMD sum is way faster than adding values in shared memory.
The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.
> SIMT is just SIMD with some additional capabilities for control flow ..
In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.
When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.
That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.
Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.
Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations
Got me laughing at the first line.