I think the point is that different parts of the story need different intonation patterns (reading a scary part vs a boring part, etc.).
So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.