As an ML infra guy I have had to debug a lot of failing jobs over the years, and randomizing datapipes are one of the hardest to debug. Sometimes there will be a "record-of-death" that randomly gets shuffled into a batch, but only causes problems when it is (extremely rarely) coupled with a few other records.
I guess I'll just have to update my priors and accept that inline synchronous randomization with random reads is a useful-enough access pattern in HPC that it should be optimized for. Certainly a lot more work and complexity, hence my question of just how necessary it is.