Aren't we talking about the auditory quality of the generated vocals? I'm don't understand how you could possibly think the textual training data could possibly impact the perceived vocal strain (which are actually just artifacts) of the generated vocals.
Don't they have models that do text-to-speech and maybe even audio/speech-to-text? If so, there is surely text in the datasets, otherwise I'm not sure how they'd accomplish something like that.