undefined | Better HN

0 pointsVeliladon5mo ago0 comments

GPUs are literal SIMD devices. Usually 32 or 64 ALU lanes.

0 comments

tubs5mo ago

They are sim-t. It’s slightly different. I made them for a living for quite some time.

Nope.

That is just a software view provided by the CUDA compiler and the NVIDIA device driver, which take care of distributing the computation over SIMD lanes and real hardware threads (called warps in the NVIDIA obfuscating jargon).

The GPU processor is just a processor with SIMD and FGMT (fine-grained multi-threading). There is nothing special from this point of view.

For any CPU you can write a compiler that provides the "SIMT" software model and there are such compilers for the Intel/AMD CPUs.

SIMT is just one of many useless synonyms created by NVIDIA for traditional terms previously used for decades in computing publications, i.e. it is the same with what Hoare called "array of processes" (in 1978-08, later implemented in the language Occam) and actually the same with "parallel for" of OpenMP, just with a slightly different syntax (i.e. CUDA separates in the source file the header and the body of the parallel FOR).

msla5mo ago

Maybe you should listen to the expert instead of thinking you know better?

Bolwin5mo ago

What's 't'? Textures? Triangles? Tensors?

craigacp5mo ago

SIMT - Single Instruction, Multiple Threads

kmeisthax5mo ago

Threads.

It's less a difference in instruction set capability and more a difference in mentality.

Like, for SIMD, you have to say "ok, we're working in vector land now" and start doing vector loads into vector registers to do vector ops on them. Otherwise, the standard variables your program uses are scalars and you get less parallelism. On a GPU this is flipped: the regular registers are vector, and the scalar ones (if you have any) are the weird ones. And because of this the code you write is (more or less) scalar code where everything so happens to magically get done sixteen times at once.

As you can imagine, this isn't foolproof and there's a lot of other things that have to change on GPU programming in order to be viable. Like, conditional branching has to be scalar, since the instruction pointer register is still scalar. But you can have vectors of condition flags (aka "predicates"), and make all the operations take a predicate register to tell which specific lanes should and shouldn't execute. Any scalar conditional can be compiled into predicates, so long as you're OK with having to chew through all instructions on both branches[0].

[0] A sufficiently smart shader compiler could check if the predicate is all-false or all-true and do a scalar jump over the instructions that won't execute. Whether or not that's a good idea is another question.

3 more replies

j / k navigate · click thread line to collapse