The post says, about SIMT / GPU programming, "This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor."
I would say that for SIMD the situation is basically the same. gather/scatter don't magically make the memory hierarchy a non-issue, but they're no longer adding any unnecessary pain on top.
Barrel threaded machines like GPUs have easier time hiding the latency of bank conflict resolution when gathering/scattering against local memory/cache than a machine running a single instruction thread. So pretty sure they have a fundamental advantage when it comes to the throughput of scatter/gather operations that gets bigger with a larger number of vector lanes