You're right. Ultimately the only way this will really work is as an end-to-end model. Text will only get you so far. We could approximate it now with screenplay-like emotion annotations on text, which LLMs should both easily understand and be able to produce themselves (though you'd have to train a new speech recognition system to produce them). But end-to-end will be required eventually to reach human level fluency.