Also note - your while loop isn't 0 cycles, in fact, it's probably worse than the bit-twiddling depending on sparseness.
This is because the test and branch always happens, and is essentially guaranteed to be mispredicted a lot.
These branch mispredictions are likely to cost a lot more than the n cycles it theoretically saves.