With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.
Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.
If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?
Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.