Implementing locks does not need this kind of loops, which may greatly increase the overhead, but only loops that do simple loads, for detecting changes, or the invocation of a FUTEX_WAIT, which is equivalent with that.
Besides loops that wait for changes, any kind of lock may be implemented with atomic read-modify-write instructions (e.g. on x86 XCHG, LOCK XADD, LOCK BTS and so on, and equivalent instructions on Armv8.1-A or later ISAs) that are not used in loops, so they have predictable overhead. For example, a futex may be used by a thread that waits for multiple events, if the other threads use a locked bit-test-and-set on the futex variable to signal the occurrence of an event, where each event is assigned to a distinct bit.
CMPXCHG and the equivalent load-and-lock/store conditional are really needed far less often than some people use them. The culprit is a widely-quoted research paper that has shown that these instructions are more universal than simple atomic fetch-and-operation instructions, allowing the implementation of lock-free algorithms, but the fact that they can do more does not mean that they should also be used when their extra power is not necessary, because that is paid dearly by introducing non-deterministic overhead.
A simple atomic instruction has an overhead much greater than an access to the L1 cache or the L2 cache, but typically the overhead is similar to that of a simple access to the L3 cache and significantly lower than the overhead of a simple access to the main memory, which remains the most expensive operation in modern CPUs.
Moreover, while mutual exclusion can be implemented reasonably efficiently with locks, it is also used far more often than necessary. It is possible to implement shared buffers or message queues that use neither mutual exclusion nor optimistic access that may need to be retried (a.k.a. lock-free access), but instead of those they use dynamic partitioning of the shared resource, allowing concurrent accesses without interference.
Organizing the cooperation between threads around shared buffers/message queues is frequently much better than using mutual exclusion, which stalls all contending threads, serializing their execution, and also much better than lock-free access, which may need an unpredictable number of retries when contention is high.