But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.
I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.
I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.
Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.
If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.
So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.
It's a gigantic difference from text, which encodes vastly less information.
(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)
And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.
It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.
So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.
Still a bit teleprompter-ish but there are tools to go in and adjust pace and style throughout and you probably hear a lot of stuff with people not using those creative features. 11labs might very well be one of the best bits of software I've used, it's a great deal of fun to play with and if you're willing to spend the time the results are superb - I don't even have a use case, I just like making them because they're fun to listen to, ha!
In any case, voice is such a thin vertical that I half expect the Chinese to release an open source TTS model that out-performs everything on the market. Tencent probably has one of these cooking right now.
I couldn't get half way through.
I really like "off a teleprompter", it accurately characterizes the subtle dissonances where it sounds like someone who is reading something they haven't read before. 0:14 "infectious (flat) beatsss (drawn out), which is near diametrically opposed to the paired snappy 0:12 "soulful (high / low) vocals (high)."
It’s not going to happen, but the only solution is to just stop developing it.
I would entreat people to consider the net effect of anything they create. Let it at least sway your decisions somewhat. It probably won't be enough to not do it, but I think of it more as the ratio between net positive :: net negative, and paying attention to that ratio should help swing it at least somewhat -- certainly more than giving up and ignoring the benefits :: harms.
but then of course if you have a generative model, you can use it to generate stuff.
- zero shot voice cloning isn't there yet
- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps
- F5 and fish-speech are both good as well
- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)
- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable
- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on
- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid
- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets
TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech
At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).
It would be more like training examples if you had to give it specific phrases.
Or do you think they should analyze the text's sentiment and raise a flag if the sentiment is obviously breaking the EULA, e.g. some kind of hate speech?
How would you implement that?
Also I found no way to filter/sort the voice selection modal on language, so I have to visually search the entire list.
Our whole team is on Elevenlabs and a switch is significant work, but I think the results are worth it! Super awesome work!
I wonder which comparison they hope to avoid, quality or price?
In brief I’d like to be able to generate conversations via api choosing voices that should be unique on the order of thousands. Essentially I’m trying to simulate conversations in a small town. Eleven is not set up for this.
Ideally I’d be able to pick a spot in latent space for a voice programmatically. But I’m open to suggestions.