That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.
But the difference to good audio books is that you have
* different voices for the narrator and each character
* different emotions and/or speed in certain situations.
I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.