Tone, Emphasis, Speed, Accent are all very important parts of how humans communicate verbally.
Before today, voice mode was strictly your audio>text then text>audio. All that information destroyed.
Now the same model takes in audio tokens and spits back out audio tokens directly.
Watch this demo, it's the best example of the kind of thing that would be flat out impossible with the previous setup.
https://www.youtube.com/live/DQacCB9tDaw?si=2LzQwlS8FHfot7Jy