undefined | Better HN

0 pointsQuercusMax4mo ago0 comments

Aren't we talking about the auditory quality of the generated vocals? I'm don't understand how you could possibly think the textual training data could possibly impact the perceived vocal strain (which are actually just artifacts) of the generated vocals.

0 comments

embedding-shape4mo ago

Don't they have models that do text-to-speech and maybe even audio/speech-to-text? If so, there is surely text in the datasets, otherwise I'm not sure how they'd accomplish something like that.

j / k navigate · click thread line to collapse

0 pointsQuercusMax4mo ago0 comments

0 comments

embedding-shape4mo ago

Don't they have models that do text-to-speech and maybe even audio/speech-to-text? If so, there is surely text in the datasets, otherwise I'm not sure how they'd accomplish something like that.

j / k navigate · click thread line to collapse