And the reason those memory operations might be seen in an order different from their appearance in the machine code is precisely the fact that the processor executes them in parallel and potentially out of order. On x86, the hardware does magic (in almost all cases) to prevent this artifact. But ARM puts the responsibility on the programmer.
But all that stuff is specified (even if it's hard to reason about). What's happening here is extra-specification, something about that cache invalidate and barrier interacts in a way that an interrupt can mess up. But we don't know what it is, because it seems like ARM didn't tell anyone.
Basically: as I see it, any OS author writing interrupt entry code on ARM64 (I work on Zephyr, though not on the ARM port) needs to put a barrier instruction on the entry path for safety, because at least some hardware misbehaves without it. But that said, almost all real OSes are going to have one anyway for locking purposes (i.e. you have to take a spinlock to interact with OS state somewhere, and htat requires a barrier on SMP ARM systems). It's likely that this Nintendo sequence is part of some kind of micro-optimized thing and not a general purpose ISR.