You are correct, but for simplicity we can assume that the CPU executes one operation at a time and still have a valid model for reasoning about data races.
> But ultimately, all inter-thread communication boils down to programs (or libraries) using barriers for acquire/release semantics and/or using compare/swap and atomic read-modify-write.
Interesting. I read about STM and it reminded me of "modification box" algorithm for modifying immutable trees, which also uses versions. So it is another clever but complicated trick. Also it seems that it requires locking? That's not very good given that locks often need syscalls.