When GPT-2/3/3.5/4 came out, it was fairly easy to see the progression from reading model outputs that it was just getting better and better at text. Which was pretty amazing but in a very intellectual way, since reading is typically a very "intellectual" "front-brain" type of activity.
But this voice stuff really does make it much more emotional. I don't know about you, but the first time I used GPT's voice mode I notice that I felt something -- very un-intellectually, very un-cerebral -- like, the feeling that there is a spirit embodying the computer. Of course with LLM's there always is a spirit embodying the computer (or, there never is, depending on your philosophical beliefs).
The Suno demos that popped up recently should have clued us all in that this kind of emotional range was possible with these models. This announcement is not so much a step function in model capabilities, but it is a step function in HCI. People are just not used to their interactions with a computer be emotional like this. I'm excited and concerned in equal parts that many people won't be truly prepared for what is coming. It's on the horizon, having an AI companion, that really truly makes you feel things.
Us nerds who habitually read text have had that since roughly GPT-3, but now the door has been blown open.