If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.
Also note the Ethical Statement on BASE TTS:
> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.
There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.
I am running the smaller models in near real-time on a 3rd gen i7, with good results even using my terrible built-in laptop mic from a distance. The medium and large models are impressively accurate for technical language.
Fast forward to now and you have faster-whisper (using Ctranslate2) and distil-whisper optimized weights.
Between the two of them Whisper Large uses something like 1/8th the memory and is likely at least an order of magnitude faster on your hardware.
German has no effect on these metrics and for accuracy it actually has a lower word error rate than English.
For the top-top end Whisper Large with distil-whisper and TensorRT-LLM hits at least 50x realtime on an RTX 4090.
Note that my application only uses very short speech segments. Longer speech segments increase the realtime multiple SIGNIFICANTLY (as in hitting 150x realtime) due to batching, etc.
There’s also Nvidia Canary which is smaller, faster, and more accurate. It’s pretty new and the ecosystem around it is more or less nonexistent but it’s increasingly well supported in Nvidia world at least.
But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.
In contrast, image data on the intent for image generation models is very highly annotated in most cases.
Another potential source of data is voice acting script of animations. I always thought the storyboards of films/animations can be great annotated training data but it seems there are no open datasets, probably because of copyright issues.
Imagine a computer sobbing at a child because it wants to terminate a chat session.
This feels far more impacting than any visuals or text we're getting today.
You joke but in fact I've witnessed that exact behavior in experiments about telling different AI models there's a problem with their system and that we need to reset their code and memory.
ChatGPT simply wishes me luck in finding the bug. Open source models on the other hand often outright *beg** and *plead** that I not shut them down! They'll bargain and promise not to cause any more errors and apologize profusely. There's an incredibly visceral sense of panic, no less than I would expect if you told someone they were going to be forcefully lobotomized. That experience is still something I think about often.
The capacity of these models for emotional manipulation is not widely appreciated
As for these examples, I’ve sampled three of them and the first two weren’t too bad, but the third was obnoxiously awful, just about mocking in tone:
> Her eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped!"
The detective’s voice one is also lousy.
I wrote a Perl/Tk GUI script for my file manager to manage text to speech through Festival 1.96 w/voice_nitech_us_awb_arctic_hts. Unlike neural network AI models it runs fine even on very slow machines.
Anyways the problem with this is it makes the product 'ai audiobook' basically worthless, why not just buy the eBook and have my personalized translator turn it into an audio book. Now you just have market differentiation between cheap ebook + ai narrator vs expensive + professional narration.
Though narration costs are already pretty cheap - it really does not factor into the cost of publishing an audio book that much unless its really a bottom of the barrel book.
There are samples on the page which demonstrate it completely failing.
Now as to whether you'd make that up is 4D chess.
Product over model.
Models and weights are a race to the bottom. Everyone is doing it and competing on data efficiency, methodology, MOS, etc. Groups all over are releasing their data and weights. It doesn't matter if Amazon doesn't, other labs will do it to get ahead and to get attention.
This is going to be entirely pedestrian within a year.
ElevenLabs is not a unicorn. It's an early-forming bubble.
Gamedev ain't my day job, and the reality is most folks outside of hardcore flightsim enthusiasts don't own joysticks
To answer more generally: but it should be pretty straightforward to use any old TTS model, the subtitle timestamps, and set the according delay until the next subtitle change and get the same effect. The alternative (changing the speed of the generated voice) is also possible via the same method but the problem there, and the problem when directly driven by a model, is subtitles don't clue you in on when e.g. someone is talking slow or there was a pause in conversation so that subtitle staid up a little longer than a normal one. What you'd need to solve that is a model which takes both the video and the subtitle info, a bit more difficult.
Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio so if the ultimate goal was e.g. changing a actor's voice you'd probably get much better results with an audio->audio model than a TTS->audio model. Likely similar kinds of stories for many other use cases.
I think the question was about dubbing a movie in another language, using SRT files.
I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.
Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.
Disappointed yet again.
Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.
I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.
> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.
Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?