> not marginally higher cost of atomic inc/dec vs plain inc/dec.
Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.
The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)
It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.