A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.
Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.