So they attacked the italicized portion and simplified the hardware. Mostly by eliminating memory-layer non-determinism / using time-sync'd global memory instructions as part of the ISA(?).
This apparently reduced the difficulty of the compiler problem to something manageable (but no doubt still "fun")... and voila, performance.