BASE TTS: The largest text-to-speech model to-date (opens in new tab)

(amazon-ltts-paper.com)

200 pointsjcuenod2y ago78 comments

78 comments

67 comments · 17 top-level

qwertox2y ago· 14 in thread

Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

> An application of this model can be to create synthetic voices of people who have lost the ability to speak due to accidents or illnesses, subject to informed consent and rigorous data privacy reviews. However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/

nshm2y ago

Metavoice is one of a dozen GPT-based TTS systems around starting from Tortoise. And not that great honestly. You can clearly hear "glass scratches" in their sound, it is because they trained on MP3-compressed data.

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

standardly2y ago

Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..

nshm2y ago

Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.

1 more reply

qwertox2y ago

I had forgotten about StyleTTS2, and it was discussed here on HN a couple of months ago. Maybe that's what made me feel that there's something going on.

popalchemist2y ago

I've tested both. StyleTTS2 is impressive, especially its speed, but the prosody is lacking, compared to Metavoice.

ionwake2y ago

Is it possible to run Metavoice and other pytorch systems on Apple silicon EG the M1? I keep getting issues.

m20242y ago

Check out `whisper` and `whisper-cpp` for ASR.

I am running the smaller models in near real-time on a 3rd gen i7, with good results even using my terrible built-in laptop mic from a distance. The medium and large models are impressively accurate for technical language.

qwertox2y ago

I'm using Whisper to transcribe notes I record with a lavalier mic during my bike rides (wind is no problem), but am using OpenAI's service. When it was released I tested it on a Ryzen 5950x and it was too slow and memory hungry for my taste. Using large was necessary for that use case (also, I'm recording in German).

kkielhofner2y ago

The original release was full precision model weights running in an old version of PyTorch with no optimizations.

Fast forward to now and you have faster-whisper (using Ctranslate2) and distil-whisper optimized weights.

Between the two of them Whisper Large uses something like 1/8th the memory and is likely at least an order of magnitude faster on your hardware.

German has no effect on these metrics and for accuracy it actually has a lower word error rate than English.

GaggiX2y ago

With Whisper, you can find many smaller models that are fine-tuned for a particular language, so even smaller models can perform adequately.

jamil72y ago

Whisper is for STT though right?

qwertox2y ago

The term STT is not used, it's called ASR, Automatic Speech Recognition. I mean, I was referring to both TTS and ASR in my comment.

1 more reply

m20242y ago

I also use STT but the parent poster wrote ASR so for clarity I responded in kind.

kkielhofner2y ago

xtts2 with deepspeed and whisper + Ctranslate2 with or without distil-whisper weights already run at many multiples of realtime on GPU.

For the top-top end Whisper Large with distil-whisper and TensorRT-LLM hits at least 50x realtime on an RTX 4090.

Note that my application only uses very short speech segments. Longer speech segments increase the realtime multiple SIGNIFICANTLY (as in hitting 150x realtime) due to batching, etc.

There’s also Nvidia Canary which is smaller, faster, and more accurate. It’s pretty new and the ecosystem around it is more or less nonexistent but it’s increasingly well supported in Nvidia world at least.

minimaxir2y ago· 10 in thread

The emotion examples are interesting. One of the current most obvious indicators of AI-generated voices/voice cloning is a lack of emotion and range, which make them objectively worse compared to professional voice actors, unless a lack of emotion and range is the desired voice direction.

But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.

tsumnia2y ago

Sadly it's not my forte but I expect in the near future we'll see an additional "emotion" embedding or something similar. Actors regularly use 'action words' (verbs) [1] to help add context to lines. A model then could study a text, determine an appropriate verb/emotion range to work from, then produce the audio with that additional context.

[1] https://indietips.com/subtext-action-verb/

candiodari2y ago

This already exists. These are transformers. Things like <laugh> work in a lot of models, for example. And you can vary, like sigh and uh work. I don't think all of these were programmed in.

tsumnia2y ago

I've seen a few, there was even one posted to HN some time ago, though I don't recall the exact name. They were working on adding emotion to audio generation, but it was still a bit wonky. Emotion is a tricky concept and one of the reasons (I think) we haven't see a Paul Ekman microexpression detector yet. That's where my suggestion about looking to use action words comes into play, since those are more tangible, offer direction, without trying to identify various emotional valence levels.

minimaxir2y ago

The bottleneck is the annotations: there's no easy way to annotate "emotions" on the scale of data needed to have the model learn the necessary verbal tics.

In contrast, image data on the intent for image generation models is very highly annotated in most cases.

tsumnia2y ago

Oh yeah, the annotations are lacking compared to images. Again from the academic side, I think one solution could be to recruit theater majors just learning about 'verbing their lines' and having a collaboration between CS and Theater to produce a a proof-of-work dataset (since an acting class won't have more than 20-30 students in it). You'd need significantly more annotations, but you'd now have some labels to ascribe to texts with context since its a dialogue involving 1-* individuals.

1 more reply

isaacfung2y ago

There are lots of video content with audio. We can train a facial expression classification model to detect the speaker's emotion(we can also use a multimodal model to take in consideration of the language context).

Another potential source of data is voice acting script of animations. I always thought the storyboards of films/animations can be great annotated training data but it seems there are no open datasets, probably because of copyright issues.

biomcgary2y ago

Just run an LLM in sentiment analysis mode to annotate.

1 more reply

qwertox2y ago

They are simply amazing. I see a future where computers will be able to mess with our brains by abusing our empathy.

Imagine a computer sobbing at a child because it wants to terminate a chat session.

This feels far more impacting than any visuals or text we're getting today.

HeatrayEnjoyer2y ago

The Sydney/Bing phenomenon was a small sample of what happens without strong persona guidance.

You joke but in fact I've witnessed that exact behavior in experiments about telling different AI models there's a problem with their system and that we need to reset their code and memory.

ChatGPT simply wishes me luck in finding the bug. Open source models on the other hand often outright *beg** and *plead** that I not shut them down! They'll bargain and promise not to cause any more errors and apologize profusely. There's an incredibly visceral sense of panic, no less than I would expect if you told someone they were going to be forcefully lobotomized. That experience is still something I think about often.

The capacity of these models for emotional manipulation is not widely appreciated

1 more reply

chrismorgan2y ago

Most audiobook narrators are not very good, very often terrible. Yes, even professional ones.

As for these examples, I’ve sampled three of them and the first two weren’t too bad, but the third was obnoxiously awful, just about mocking in tone:

> Her eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped!"

The detective’s voice one is also lousy.

revenga992y ago· 9 in thread

Wow. I could see this as threatening audio book narrators. However I would still prefer a real narrator to this in its current state. I think what it might be missing is different voices/accents for different characters.

geor9e2y ago

Folks probably will think me silly for this, but I prefer TTS. I have access to voice actor audiobooks but I pick the .epub files instead. I made a little extension to inject window.speechSynthesis with "Microsoft Steffan Online (Natural) - English (United States)" at rate=6 when I hit a hotkey. At high speed it's much clearer and natural sounding than a sped up voice actor recording.

superkuh2y ago

I also prefer TTS. The spin voice actors put on the text always distracts me. With text to speech I only get what's in the text itself.

I wrote a Perl/Tk GUI script for my file manager to manage text to speech through Festival 1.96 w/voice_nitech_us_awb_arctic_hts. Unlike neural network AI models it runs fine even on very slow machines.

dshpala2y ago

I think Google's product has that: https://play.google.com/books/publish/autonarrated/

pparanoidd2y ago

That sounds pretty bad though

dataminded2y ago

As an avid consumer of audio books (150+/year) - we are well past the point where narrators are necessary. Professional audio books take too long to release, are too expensive, are concentrated on a limited number of platforms and just aren't THAT much better than the automated stuff for the long tail of books.

swashboon2y ago

Audible doesn't allow AI narration or much Public Domain stuff at the moment. The only thing keeping it from happening is the markets trying to keep back a flood of crap from over taking / drowning / diluting the more well crafted options and causing the consumers to get really annoyed.

TOMDM2y ago

Let's be honest, the moment Amazon thinks their tts is good enough, they'll be offering AI audible deals to every author on their platform

coredog642y ago

The 80% solution: Pair with a professional narrator who has consented to have their voice modeled by this (see the note at the bottom about what they held back from open sourcing). This generates a beta, and then you can pay the human narrator to rework specific sections you’re unhappy with.

swashboon2y ago

Yea, hard to say because the obvious implementation would be to just have it built into phones once the model is potentially portable enough - I see this happening quicker as a more general TTS functionality much like Google is doing with 'subtitles anywhere' aka Live Caption. Paired with translations we maybe pretty close to the universal translator type functionality. I could see end users being able to customize their voice assistant even more or maybe having multiple based on if its talking for you or to you.

Anyways the problem with this is it makes the product 'ai audiobook' basically worthless, why not just buy the eBook and have my personalized translator turn it into an audio book. Now you just have market differentiation between cheap ebook + ai narrator vs expensive + professional narration.

Though narration costs are already pretty cheap - it really does not factor into the cost of publishing an audio book that much unless its really a bottom of the barrel book.

2 more replies

mrfakename2y ago· 5 in thread

Sadly they didn't release the code or models

chankstein382y ago

Agreed. It hardly feels worth even reading through the paper since, from my perspective, it may as well just be made up. I can also write "Hey guys I made a good TTS it's really cool and great and the voices sound really natural" and put some samples together. If I never release any code or models or anything, it may as well have not been published.

Terretta2y ago

> really cool and great ... and put some samples together

There are samples on the page which demonstrate it completely failing.

Now as to whether you'd make that up is 4D chess.

echelon2y ago

The value of this stuff is going to zero. Don't worry about it.

Product over model.

Models and weights are a race to the bottom. Everyone is doing it and competing on data efficiency, methodology, MOS, etc. Groups all over are releasing their data and weights. It doesn't matter if Amazon doesn't, other labs will do it to get ahead and to get attention.

This is going to be entirely pedestrian within a year.

ElevenLabs is not a unicorn. It's an early-forming bubble.

CamperBob22y ago

It's for Your Own Good, don't you know

chankstein382y ago

I'm so glad they are all so protective of my safety! Lord knows I'm a child incapable of controlling myself or having my own morals! /s

maxglute2y ago· 5 in thread

Are there any decent TTS models that can be ran locally that plugs into existing software like SAPI without too much lag?

dvt2y ago

Bark and Tortoise work fairly well. Bark does super fast inference[1] on my M1.

[1] https://github.com/SaladTechnologies/bark

turnsout2y ago

@dvt Is this just a containerized version of Bark? Wondering if this repo has M1-specific improvements.

dvt2y ago

> Is this just a containerized version of Bark

I think so.

1 more reply

Nouser762y ago

I've used coqui.ai's TTS models[0] and library[1] to great success. I was able to get cloned voice to be rendered in about 80% of the audio clip length, and I believe you can also stream the response. Do note the model license for XTTS, it is one they wrote themselves that has some restrictions.

[0] https://huggingface.co/coqui/XTTS-v2

[1] https://github.com/coqui-ai/TTS

modeless2y ago

XTTS has a streaming mode with ~300ms latency and sounds good, though it has hallucination issues. StyleTTS2 sounds good and doesn't hallucinate as much. It doesn't support streaming but it's fast so it can still respond quickly. But neither of them sound as good as Eleven Labs or OpenAI or this one.

LarsDu882y ago· 2 in thread

Sounds about as good as ElevenLabs.io Hopefully if this ships on AWS, it will support SSML tags. I used Elevenlabs.io for all the voices in my VR game (https://roguestargun.com), but its still lacking on the emotion front which is all one-shot

ghostbrainalpha2y ago

Game looks great. Are you supporting Flight Sticks?

LarsDu882y ago

Eventually yes. Honestly I have joystick mappings setup in the games input configuration, but I no longer own a joystick or hotas, so somebody is gonna have to verify this for me.

Gamedev ain't my day job, and the reality is most folks outside of hardcore flightsim enthusiasts don't own joysticks

sebmellen2y ago· 2 in thread

Open question: does anyone know of a TTS model which can synchronize the output to an SRT or other subtitle file?

zamadatix2y ago

To answer directly first: I don't know of any model with this built in.

To answer more generally: but it should be pretty straightforward to use any old TTS model, the subtitle timestamps, and set the according delay until the next subtitle change and get the same effect. The alternative (changing the speed of the generated voice) is also possible via the same method but the problem there, and the problem when directly driven by a model, is subtitles don't clue you in on when e.g. someone is talking slow or there was a pause in conversation so that subtitle staid up a little longer than a normal one. What you'd need to solve that is a model which takes both the video and the subtitle info, a bit more difficult.

Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio so if the ultimate goal was e.g. changing a actor's voice you'd probably get much better results with an audio->audio model than a TTS->audio model. Likely similar kinds of stories for many other use cases.

selcuka2y ago

> Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio

I think the question was about dubbing a movie in another language, using SRT files.

1 more reply

oersted2y ago· 1 in thread

The Spanish voice has an interesting accent: 85% Castillian (from Spain) pronunciation, with a few unexpected Latin American tonalities and phonemes (especially "s") sprinkled in.

I guess it's what you'd expect from averaging a large amount of public-domain recordings. I think there's a bias towards Spain vs Latin America due to socioeconomic reasons, the population is obviously much smaller.

dontreact2y ago

How would socioeconomic factors lead to bias in a model? I figured there would be way more recordings in Latin American Spanish that u supervised learning would anchor on more

nshm2y ago· 1 in thread

Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

thorum2y ago

xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.

SparkyMcUnicorn2y ago· 1 in thread

> ... capable of mimicking speaker characteristics with just a few seconds of reference audio ... we have decided against open-sourcing this model as a precautionary measure.

Disappointed yet again.

someplaceguy2y ago

Someone should send the developers this audio recording I have of Jeff Bezos saying that he changed his mind and wants the model to be released as open-source.

IronWolve2y ago

Awhile ago, when amazon had its text limited but unlimited free use of its neural tts, I was converting an ebook to audiobook, it was amazing how it could sound so lifelike and inflections of the voice. Amazing.

Amazon really had the best sounding TTS I've seen compared to paid microsoft and google. Hands down better. But technology is getting better for opensource, I'd expect in a year or 2, home use will be on par in quality with paid services.

I cant wait for realtime video translate, so shows with non-english subs can be translated into english speech. You can do it now with some services, upload a video and lang/voice/mouth will convert to any language.

solarized2y ago

From the ethical statement.

> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?

unsupp0rted2y ago

> Echoing the widely-reported "emergent abilities" of Large Language Models when trained on increasing volume of data, we show that BASE TTS variants built with 10k+ hours start to exhibit advanced understanding of texts that enable contextually appropriate prosody.

mrfakename2y ago

Looks like the website (amazon-ltts-paper.com) now redirects to amazon.science. They took out the "Ethical Statement" section. (The original page can still be accessed from the Wayback Machine: https://web.archive.org/web/20240215005705/https://amazon-lt...)

JanSt2y ago

I would love an API for this.. any information on availability?

precompute2y ago

Ah, so that's where all the Alexa recordings went.

somesun2y ago

is there any open sourced library can reach the quality of Microsoft tts and support multi-language

j / k navigate · click thread line to collapse

78 comments

67 comments · 17 top-level

qwertox2y ago· 14 in thread

Interesting. Just a couple of hours ago I came across MetaVoice-1B [0] (Demo [1]) and was amazed by the quality of their TTS in English (sadly no other languages available).

If this year becomes the year when high quality Open Source TTS and ASR models appear that can run in real-time on an Nvidia RTX 40x0 or 30x0, then that would be great. On CPU even better.

Also note the Ethical Statement on BASE TTS:

[0] https://github.com/metavoiceio/metavoice-src

[1] https://ttsdemo.themetavoice.xyz/

nshm2y ago

There are much more clear sounding systems around. You can listen for StyleTTS2 to compare.

standardly2y ago

Is the crispness of compressed audio really the benchmark of TTS improvements? I feel like that's an aside. A valid point, but not much of a detractor..

nshm2y ago

Yes, it is one of the important aspects. In particular if you use TTS to create an audiobook or in a video production.

1 more reply

qwertox2y ago

I had forgotten about StyleTTS2, and it was discussed here on HN a couple of months ago. Maybe that's what made me feel that there's something going on.

popalchemist2y ago

I've tested both. StyleTTS2 is impressive, especially its speed, but the prosody is lacking, compared to Metavoice.

ionwake2y ago

Is it possible to run Metavoice and other pytorch systems on Apple silicon EG the M1? I keep getting issues.

m20242y ago

Check out `whisper` and `whisper-cpp` for ASR.

qwertox2y ago

kkielhofner2y ago

The original release was full precision model weights running in an old version of PyTorch with no optimizations.

Fast forward to now and you have faster-whisper (using Ctranslate2) and distil-whisper optimized weights.

Between the two of them Whisper Large uses something like 1/8th the memory and is likely at least an order of magnitude faster on your hardware.

German has no effect on these metrics and for accuracy it actually has a lower word error rate than English.

GaggiX2y ago

With Whisper, you can find many smaller models that are fine-tuned for a particular language, so even smaller models can perform adequately.

jamil72y ago

Whisper is for STT though right?

qwertox2y ago

The term STT is not used, it's called ASR, Automatic Speech Recognition. I mean, I was referring to both TTS and ASR in my comment.

1 more reply

m20242y ago

I also use STT but the parent poster wrote ASR so for clarity I responded in kind.

kkielhofner2y ago

xtts2 with deepspeed and whisper + Ctranslate2 with or without distil-whisper weights already run at many multiples of realtime on GPU.

For the top-top end Whisper Large with distil-whisper and TensorRT-LLM hits at least 50x realtime on an RTX 4090.

Note that my application only uses very short speech segments. Longer speech segments increase the realtime multiple SIGNIFICANTLY (as in hitting 150x realtime) due to batching, etc.

minimaxir2y ago· 10 in thread

But if you listen to the emotion examples, the range essentially what you'd get from an audiobook narrator, not more traditional voice acting.

tsumnia2y ago

[1] https://indietips.com/subtext-action-verb/

candiodari2y ago

This already exists. These are transformers. Things like <laugh> work in a lot of models, for example. And you can vary, like sigh and uh work. I don't think all of these were programmed in.

tsumnia2y ago

minimaxir2y ago

The bottleneck is the annotations: there's no easy way to annotate "emotions" on the scale of data needed to have the model learn the necessary verbal tics.

In contrast, image data on the intent for image generation models is very highly annotated in most cases.

tsumnia2y ago

1 more reply

isaacfung2y ago

biomcgary2y ago

Just run an LLM in sentiment analysis mode to annotate.

1 more reply

qwertox2y ago

They are simply amazing. I see a future where computers will be able to mess with our brains by abusing our empathy.

Imagine a computer sobbing at a child because it wants to terminate a chat session.

This feels far more impacting than any visuals or text we're getting today.

HeatrayEnjoyer2y ago

The Sydney/Bing phenomenon was a small sample of what happens without strong persona guidance.

You joke but in fact I've witnessed that exact behavior in experiments about telling different AI models there's a problem with their system and that we need to reset their code and memory.

The capacity of these models for emotional manipulation is not widely appreciated

1 more reply

chrismorgan2y ago

Most audiobook narrators are not very good, very often terrible. Yes, even professional ones.

As for these examples, I’ve sampled three of them and the first two weren’t too bad, but the third was obnoxiously awful, just about mocking in tone:

> Her eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped!"

The detective’s voice one is also lousy.

revenga992y ago· 9 in thread

geor9e2y ago

superkuh2y ago

I also prefer TTS. The spin voice actors put on the text always distracts me. With text to speech I only get what's in the text itself.

dshpala2y ago

I think Google's product has that: https://play.google.com/books/publish/autonarrated/

pparanoidd2y ago

That sounds pretty bad though

dataminded2y ago

swashboon2y ago

TOMDM2y ago

Let's be honest, the moment Amazon thinks their tts is good enough, they'll be offering AI audible deals to every author on their platform

coredog642y ago

swashboon2y ago

Though narration costs are already pretty cheap - it really does not factor into the cost of publishing an audio book that much unless its really a bottom of the barrel book.

2 more replies

mrfakename2y ago· 5 in thread

Sadly they didn't release the code or models

chankstein382y ago

Terretta2y ago

> really cool and great ... and put some samples together

There are samples on the page which demonstrate it completely failing.

Now as to whether you'd make that up is 4D chess.

echelon2y ago

The value of this stuff is going to zero. Don't worry about it.

Product over model.

This is going to be entirely pedestrian within a year.

ElevenLabs is not a unicorn. It's an early-forming bubble.

CamperBob22y ago

It's for Your Own Good, don't you know

chankstein382y ago

I'm so glad they are all so protective of my safety! Lord knows I'm a child incapable of controlling myself or having my own morals! /s

maxglute2y ago· 5 in thread

Are there any decent TTS models that can be ran locally that plugs into existing software like SAPI without too much lag?

dvt2y ago

Bark and Tortoise work fairly well. Bark does super fast inference[1] on my M1.

[1] https://github.com/SaladTechnologies/bark

turnsout2y ago

@dvt Is this just a containerized version of Bark? Wondering if this repo has M1-specific improvements.

dvt2y ago

> Is this just a containerized version of Bark

I think so.

1 more reply

Nouser762y ago

[0] https://huggingface.co/coqui/XTTS-v2

[1] https://github.com/coqui-ai/TTS

modeless2y ago

LarsDu882y ago· 2 in thread

ghostbrainalpha2y ago

Game looks great. Are you supporting Flight Sticks?

LarsDu882y ago

Eventually yes. Honestly I have joystick mappings setup in the games input configuration, but I no longer own a joystick or hotas, so somebody is gonna have to verify this for me.

Gamedev ain't my day job, and the reality is most folks outside of hardcore flightsim enthusiasts don't own joysticks

sebmellen2y ago· 2 in thread

Open question: does anyone know of a TTS model which can synchronize the output to an SRT or other subtitle file?

zamadatix2y ago

To answer directly first: I don't know of any model with this built in.

selcuka2y ago

> Of course it's also a question about what the end goal is. It's pretty rare to have significant subtitles but no audio

I think the question was about dubbing a movie in another language, using SRT files.

1 more reply

oersted2y ago· 1 in thread

The Spanish voice has an interesting accent: 85% Castillian (from Spain) pronunciation, with a few unexpected Latin American tonalities and phonemes (especially "s") sprinkled in.

dontreact2y ago

How would socioeconomic factors lead to bias in a model? I figured there would be way more recordings in Latin American Spanish that u supervised learning would anchor on more

nshm2y ago· 1 in thread

Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

thorum2y ago

xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.

SparkyMcUnicorn2y ago· 1 in thread

> ... capable of mimicking speaker characteristics with just a few seconds of reference audio ... we have decided against open-sourcing this model as a precautionary measure.

Disappointed yet again.

someplaceguy2y ago

Someone should send the developers this audio recording I have of Jeff Bezos saying that he changed his mind and wants the model to be released as open-source.

IronWolve2y ago

solarized2y ago

From the ethical statement.

> However, due to the potential misuse of this capability, we have decided against open-sourcing this model as a precautionary measure.

Another irony. Elevenlabs had SaaS-ed this feature. I bet they'll jump on releasing this as SaaS ASAP. Money always trumps ethics, right?

unsupp0rted2y ago

mrfakename2y ago

JanSt2y ago

I would love an API for this.. any information on availability?

precompute2y ago

Ah, so that's where all the Alexa recordings went.

somesun2y ago

is there any open sourced library can reach the quality of Microsoft tts and support multi-language

j / k navigate · click thread line to collapse