Oh yeah, the annotations are lacking compared to images. Again from the academic side, I think one solution could be to recruit theater majors just learning about 'verbing their lines' and having a collaboration between CS and Theater to produce a a proof-of-work dataset (since an acting class won't have more than 20-30 students in it). You'd need significantly more annotations, but you'd now have some labels to ascribe to texts with context since its a dialogue involving 1-* individuals.
I wonder how theatre students will feel about helping to train an AI to produce theatrical TTS? Artists seem pretty mad about their work being used to automate artwork.