The subtlety is that eliminating branching isn't
sufficient to have constant time code. A simple example is using trigonometric and transcendental opcodes. They don't branch (at the assembly level), but on x86 take variable amounts of time depending on the input operand. Very few algorithms actually use these opcodes though, so a more relevant concern is memory access due to variable latency. Even if you have that nailed down, integer operations like multiplication and especially division can take variable amounts of time depending on the input.
Writing truly constant time code on modern processors ranges is difficult at best, and usually less efficient than variable-time code.