In particular, x86's total store ordering memory model causes some memory fences to disappear at the machine code level. The Aarch64 relaxed memory model allows for lower cache synchronization overhead, but code with correct memory fences compiled to x86 loses this information, requiring overly conservative binary translation/higher overhead TSO mode in Aarch64 binary translators. These days, hardware acquire/release/full flavors of memory fences better match the C++ and Java memory models, but some hardware has load/store/full flavors of memory fences. Binary translation across these flavors means changing all fences to full fences, or else some static analysis that's far beyond anything I'm aware existing at this time.