It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.