Knowing the Tegra chip in question, I'd bet it's probably not ARM's fault. Tegra X1, unlike pretty much every other SoC, did big.LITTLE via cluster migration with a custom cache coherence system, instead of just having a bunch of heterogenous cores. It turns out that their custom cache coherence was unfixably broken and would randomly corrupt memory when doing migration between the big and little cores, so everyone was forced to just entirely disable one set of cores. At least NVIDIA managed to fix this one in software?
You are most likely right. I would expect that barriers would be needed unconditionally. This patch is likely working around some hole in the SoC where the usual barrier is not sufficient for some instructions and some additional ritual need to be performed.