I'm surprised an ESP32 is powerful enough for this task, I guess I underestimated it (or I overestimate the power needed to decode common audio formats in realtime :)