Except a typical desktop system is usually a mix of low latency and high latency audio streams. You're playing music,
and you're typing on a 'clacky' virtual keyboard. The user doesn't want 100ms of lag with each finger tap till they hear the audible feedback. Yet when no typing is happening, the CPU doesn't want to be waking up 10x per second just to fill audio buffers.
The solution is to fill a 5 minute buffer with 5 minutes of your MP3 and send the CPU to sleep, and then if the user taps the keyboard, rewind that buffer, mix in the 'clack' sound effect, and then continue.