It should also be hardware-assisted (if it's impossible to stop it at the hardware level). Stuff like atomic counters were a good first step but the state of the art there hasn't moved in a long time (we don't have atomic 512-bit update operations, do we? if we did, a lot of kernel data structures could be atomically updated; this alone would fix a lot of contention problems).
The hardware can also just work in a CRDT/OT-like manner, namely accumulate update requests for a memory location in small queues (that get flushed after 10 nanoseconds or after the mini queue fills up). This could help with a good chunk of the buggy scenarios as well.
Not saying we can make it perfect tomorrow -- of course we can't. What I am saying is that nobody [who can truly make a difference] is even trying.
It gets a bit disappointing and grim after you have been in the profession for a while, you know?