undefined | Better HN

0 pointsgpuhacker7y ago0 comments

It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.

0 comments

2 comments · 1 top-level

dialecticDolt7y ago· 1 in thread

Do you have any idea what the source of the slide deck could have been? It sounds very interesting and I'd love to see it.

gpuhackerOP7y ago

Sure! I meant this one: https://developer.download.nvidia.com/assets/cuda/files/redu...

j / k navigate · click thread line to collapse