And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.
It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.
So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.