Anyone who has commuted on public transport probably knows this intuitively. (Using a kick scooter instead of walking cut my travel time by a good 5% which was excellent, as I still needed to be on a bus where that made no difference.)
For example some stats from Whisper [0] (audio transcoding, 30 seconds) show the following for the medium model (see other models in the link):
---
GPU medium fp32 Linear 1.7s
CPU medium fp32 nn.Linear 60.7s
CPU medium qint8 (quant) nn.Linear 23.1s
---
So the same model runs 35.7 times faster on GPU, and compared to an "optimized" model still 13.6.
I was expecting around an order or magnitude of improvement.
Then again, I do not know if in the case of this article the entire model was in the GPU, or just a fraction of it (22 layers) and the remainder on CPU, which might explain the result. Apparently that's the case, but I don't know much about this stuff.
[0] https://github.com/MiscellaneousStuff/openai-whisper-cpu