On x86, any CAS on a misaligned address that crosses a cache line boundary can fault in the best case (if the mis-feature is disabled by the os) or cost thousands of clock cycles on all cores. So it "works" only for small values of "works".
That's over a cache line boundary, but 128 bit don't even work when they are unaligned, so you can't do things like swap two pointers, then move down 64 bits and swap two more pointers.