Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.
Plus intel has introduced the MWAIT[0] instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.