Consider that most of us can read text at a normal unhurried speed of 200wpm to 300wpm. We also hear this "inner voice" as we read the written words. This subvocalization is therefore "sounding out" the words at ~300wpm in our head.
However, many people speak out loud at only ~100wpm. For many of us, accelerating speech from 1x to 3x is simply making the speaker sound out the words at 300wpm. Since that's the same as our subvocalization wpm, the meaning is not impossible to follow. Most humans can't move their mouths fast enough to talk at 300wpm -- but with digital technology -- they don't have to.
Spoken recordings at 100wpm is mind-numbingly slow and it would just make my mind wander.
I think the authors should have surveyed a hundred youtube users that always take advantage of the 2x speed option. They would have been surprised to learn that people can follow the meaning of the words at 200wpm very easily.