We’re currently focused on researching and deploying a different way for speech synthesis that can generate nuanced intonation and emotions by understanding text and taking context into account. Additionally, we provide creators with a way to clone their own voice based on very short samples. With the published blog post, we are now deploying a way to help them design entirely new ones!
Anyone will be able to generate that level of quality just with a copy-paste. We are planning to open up Beta later this month. Our goal is to let you convert any written content into high-quality, compelling audio.
To address a few questions that frequently came up:
- Latency for our streaming TTS is <1s with quality results available above, which is the usual problem with existing good TTS models (like tortoise-tts)
- We can clone voices instantly, based just on 5s of speech, without training required
- We are working on adding SSML-like support for better control; speed controls will be coming as part of that too
- API is directly available as part of Beta; we are preparing the infrastructure to scale easily for the release!
We are hiring researchers, frontend and full-stack developers! If you are interested, send over your GitHub account and short message to founders[at]elevenlabs.io.
Could the designer share a little about how it was made? Does it represent one of the generated voices, or is it just 'artistic'? (both are cool, I think).
[1] https://blog.elevenlabs.io/content/images/2023/01/Sequence-0...
The only issue is that the actual recordings sound like they have been overcompressed, or poorly recorded - is there any way to improve this? Something like superresolution, but for voice?
Currently testing Beta with a range of storytelling and publishing use-cases, tackle relevant feedback and make sure the infrastructure supports it. We are planning to open up Beta to everyone by end of this month.
Voice Design interface is currently set of sliders and toggles but currently iterating on what is most accessible.
https://cloud.google.com/text-to-speech
https://azure.microsoft.com/en-us/products/cognitive-service...
And now one is also a service.
I tried using tortoise-tts on my M1. Generating a 7 minute speech took 3 days and, while better than the 15 yr old text-to-speech built into the OS it wasn't close to the quality of the services above. Maybe I don't know who to use it but of course it's not as simple as text-to-speech. You need the system to ideally understand the text it can act out parts
Of course see my username. I want to generate personal adult content so I'd prefer not to upload it to a service.
...you will be disappointed by the answers to that question for the foreseeable future.
As impressive as a lot of these models are, I can't help but feel like they're going to end up making an incredible amount of sterile soulless content that makes everyone's lives worse. We're already drowning in ad dominated cynical soulless computer generated search results. Are all online forums going to end up being drowned out by cynical pumped out super cheap to produce simulacrums of creative content now too?
If I want people to buy more Triscuts next year what's stopping me from writing a bunch of prompts to insert subtle marketing cues to buy Triscuts with entire fake ecosystems of users, fan art, radio call ins, user stories, etc in like every niche community in existence and flooding them with soulless fake interaction?
That exists to a certain extent already, but I don't see how this stuff won't make it way easier, way more effective, and way more widespread.
Most of the time, the giveaway is the callers' Indian accent. If you could simply type into a box and speak with an American accent, it would be really hard to get caught.
We're opening a pandora's box here if I'm honest. I'm hardly one for pro-regulation, but good God, we're playing with things here that can really hurt us down the line.
Not really. If they say "kindly do something", they are Indian scammers.
My sentiments exactly. I think it's a bit of column A and a bit of column B. I'm reminded of the quote "everything has its pleasure and its price". The more expensive things are to produce, the the less of it there will be, but what is produced will be higher quality across the board. The less expensive it becomes to produce, the more of it will be, and the aggregate quality will be lower.
It's not always a bad thing, but the downsides are plain to see when you look at the amount of spam and low-effort content out there. That said, we've all massively enjoyed the upsides too, so it's a balancing act. I think where things were at before the recent wave generative AI tools was perhaps right on the sweet spot of "it's democratized enough that anyone can have a go, but still requires effort and a degree of talent to do well". The knowledge (and entertainment) I've been able to access thanks for randoms on YouTube is pretty incredible, and I sort of always just accepted the avalanche of spam and clickbait that came with it.
These new tools potentially push that effort/reward ratio to the point where the signal/noise ratio simply gets too high. Of course the "make money online" community is all over this stuff and today I watched a video of a guy showing how you could supposedly clone courses on Udemy using ChatGPT and other tools etc. The problem is the "course" would literally consist of generic advice, high-level information on a particular topic that suffices only as a very surface level introduction and isn't enough to help you build any functional skills in that domain, so it's effectively useless. The only person it's not useless to is him and as he would pocket a cool $5-ish per sale. It was somewhat sad and somewhat sick to hear him cackling away about being able to con people out of money while passing himself off as an expert.
And yet, it's entirely what I would expect would happen.
Adblockers are amazing
The most dangerous aspect of this is that each step seems relatively harmless: right now, ChatGPT and DALL-E are amusements, but each small step is building a monstrous and as you say, soulless machine that overloads us so much that we will forget what it's like to even be human.
I firmly believe (and I have given this a lot of thought) that technology is ultimately evil, and that tech companies are trading short term gain of enormous wealth for the very essence of humanity, preying upon the basic instincts of individuals who are also trading their personal worth for convenience.
If I could have one single wish fulfilled in this world, it would be that every single human being gain a natural and instinctual revulsion for advanced technology. If someone asked me what disease was the worst that ever plagued humanity, it would not be smallpox or the flu or COVID, it would be the tech company.
Nature is SHIT, that is why people created technology. There is nothing preventing you from going to the middle of nowhere and reject modernity, no one is forcing you, but you are that because you wanted and liked it. You talk people should have an "instinctual revulsion" towards technology, but not even you yourself has this reaction towards technology because it is a stupid idea that not even luddites like you commit to it.
If anything the technology we have nowadays is not even 0,01% of what we should have. We should have the technology to make any movie anyone ever wanted to see in a blink of an eye, all done in the best quality ever imagined. We should have the power to build a Dyson Sphere around the sun to harness its energy. We should be able to construct fully immersive virtual reality, like San Junipero from the Black Mirror's episode, we should have the power extend human life indefinitely.
Even like, a bic light is so much better quality than a flint and steel or fire sticks.
Smart phones are fantastic quality computers that enable cool stuff like meeting up with friends without first having to leave a note at their house some amount of time beforehand
Dishwashers and laundry machines and modern quality clothing let you avoid spending half your waking hours cleaning stuff, keeping us healthier, and enabling feminism
Electricity lets us stay awake at night without smoke inhalation from candles and fireplaces, with less likelihood of burning the house down, and advanced tech in housing standards make sure that when the building does catch fire, you'll be able to get out safely
Advancements in technology are mostly quite good, and improve both quality and convenience
Time to go live in a cabin in the woods and go write your manifesto on a typewriter...
It is perfectly viable in the modern day, to work a job, have passionate hobbies, regularly meet for social events, volunteer, etc and spend minimal to zero time engaging on the internet, besides pragmatic things like map directions
I've more or less come to a pretty similar conclusion. I wouldn't characterize it as evil per se, but it's a fools errand at best. My line of thinking goes somewhat like this - before the Neolithic revolution humans had an extremely small set of problems. The main problem being "what am I going to eat?", and to a large degree, life must have revolved around this problem almost entirely. There weren't that many people, there weren't that many problems, we somehow persisted in that state for hundreds of thousands of years with literally nothing to write home about. Any advance in technology has literally been trading one problem for at least three more. Now there are loads of problems, loads more people, and the standard approach to solving all the problems is to invent new technologies, which in practice seem to actually exacerbate the problems. So, I just sort of view the current state of things as "somewhere around the turn of the Neolithic Revolution we took a wrong turn, and it has widely been regarded as a bad move."
It's a weird sort of defeatist, nihilistic, melancholy worldview, but to be honest, I don't think we're wrong. I mean... what's the endgame of technology?
Technology evolves. Even if it may start with some low quality aspects, it doesn't need to stay that way.
> People now interact more through technology which removes a lot of body language and other enriching experiences.
Which is just different communication, not better, nor worse in general. Of course this kinda sucks for people who do not know the new communication-code well enough. But people do evolve communication to replace relevant missing parts. Body language for example was mostly replaced with emojis and memes, which can be better, or worse.
> we will forget what it's like to even be human.
You can't forget what you are. You are you everyday, ever minute, every second of your existence. What you speak about is people having a different culture from the one you know and understand. That's something completely different.
> technology is ultimately evil
Technology is a tool, is can't be evil or good. It's up to the users how they handle it.
I went to the mall today and you can tell malls are dying. I lived in a small town where the mall died and it had a zombie like existence a long time before it finally cratered. The mall here in this larger town has that feeling. I also thought about how nice it is to go to the mall just to be out among people. The same is true of the downtown. If the endgame is for everyone to stay home and shop online that's going to be a very soulless existence.
If we stop pursuing technological progress, we'll never be able to reach humanity's true potential. If we keep pursuing technological progress, those futures are still possible. We need to be wiser and mature about the way we pursue it but we still need to pursue it.
Something like this could be incredible for those people. A natural sounding alternative to text to speech for people who dislike how they sound.
And it could also be used to anonymise people in documentaries about serious topics (like say, organised crime) without actors, letting people bring the atrocities of said folks to light without need to trust others or the risk of being found out.
Other examples could include vTubers, artists creating characters for TV shows, films and video games, etc.
All technology can be abused, and sadly with how humanity acts, like will by a small percentage of the population. But for every person abusing it for dubious purposes, there are dozens or hundreds or thousands of others who can make the world better with it.
But more importantly, boredom triggers innovation. As we are consuming ourselves to death, we might lose the ability to truly create. Maybe that's why the last 20 years of content feel quite generic and sterile.
I think there will be need for a greater level of filtering and curation yes, but I see it as an opportunity both for creators and curators.
The barriers to entry for media creation will go down, but with saturation also the already low margins of profit will get worse.
Eh. I’ll take a MAYBE over the past 10 years or more of the human driven social media manipulations and scams and poison. We’ve made almost literally fucking nothing of value in a decade. It’s been ads, Ponzi schemes, and a race to the bottom of tolerance.
I’ll take the democratization of content. Knowing that it will allow the good and the bad.
… so how it different from the radio or TV or “influencers” now? I have limited time to consume media and am not going to be less picky when it gets easier for people to make garbage.
The last year and a half things are starting to pop off. OpenAI, SpaceX, Comma, Helion, many more…that doomer “everything sucks and is collapsing” mentality is on the way out in my opinion. The time for talk is over and it’s time to build, or so they say.
Humans will be different on the other side of this coming wave of simulation indistinguishable from reality, but we’ll be okay.
It’ll suck living through the transition though. Not looking forward to the crap tsunami on the horizon.
But as a species, we’ll adapt and survive as always.
I could do with cutting my screen time and the best way to do that might be to make everything boring.
Worse? We already have 8 billion people a significant part of whom are pumping sterile soulless content.
If anything it will heat up the competition in the creativity field and allow truly creative things to proliferate.
(Especially if you know how well Meryl Streep delivers that monologue in the original: https://youtu.be/Ja2fgquYTCg)
From wikipedia:
Mary Louise "Meryl" Streep [is] often described as "the best actress of her generation." Streep is particularly known for her versatility and accent adaptability. She has received numerous accolades throughout her career spanning over five decades, including a record 21 Academy Award nominations, winning three, and a record 32 Golden Globe Award nominations, winning eight. She has also received two British Academy Film Awards, two Screen Actors Guild Awards, and three Primetime Emmy Awards, in addition to nominations for a Tony Award and six Grammy Awards.
(and also I can't wait for a "real" ChatGPT-era AI to go with it, to put those braindead jokes of an "assistant" Siri, Alexa, and Google Assistant out to pasture)
When I listened to it, my first impression was that it must be the real actor they included for comparison purposes but that they failed to label it correctly. I thought it is not machine-generated. I couldn't tell the slightest artifact except what sounded like a low-bitrate sound encoding (maybe using a codec geared toward speech). Can you tell anything "off" about it?
As for the encoding artifact such as a tinny sound or low-bitrate sound, that is the type you hear on an MP3 or low bitrate codec for speech. For example, when I record a message on https://vocaroo.com/ the "premier" voice recording service it sounds 10x worse. Here is a sample I just recorded of my own speech: https://voca.ro/18oSJ1sHU5w5
After my first impression that the narrative example might be a real human mislabelled for comparison purposes, I listened to the next two, labelled News and Conversational. I found these very easy to tell as AI-generated.
Thinking back to why I found the narrative example so compelling, I thought perhaps the issue is that the first example is in British English which I'm less used to than American English. I grew up in the United States. Perhaps since the accent doesn't match my own, it is harder for me to perceive it as generated.
-> Can a native speaker of British English tell us whether listening to the first example you can tell in any way that it is a robot? Maybe it is as obvious to you as the next two are to me.
Still, I've listened to a fair amount of British English in my life so perhaps there is an alternative explanation for why the first one was better. For example, it could have been trained on a reader's voice who has narrated thousands of hours in very high studio quality in a fairly consistent way, leaving this type of text much easier to synthesize than the other two examples due to more training data or higher-quality audio.
For me, the first one is really indistinguishable from a narrator's true voice, though it does sound a bit tinny which could also happen as an artifact of the recording process.
In terms of "how confident are you that this is a real person" the second two examples I would put at 0 - it's totally obvious that it is not a real person, whereas the first one sounds like a 10 to me: obviously a real narrator. (With a bit of artifacting that sounds like an mp3.)
[1] The text is here https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...
Why do seemingly all these text-to-speech programs attempt to produce spoken voice based solely on raw text? Why don't they consume a MIDI-like text-markup language where you can write phonetic pronunciations along with markup about the emotion, volume, speed, etc.? I feel like this is a huge unnecessary roadblock holding back this kind of technology. It'd be like if every music composition program rendered a wave file not by MIDI or VST, but by trying to visually read sheet music. I totally understand why TTS solutions that have to consume arbitrary content, like screen-readers, need to read purely raw text. But content creators don't need to be limited to raw text! Why is everyone doing it that way? Where is the TTS markup language for content creators?
https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Langua...
Amazon Polly, (which seems kind of ancient with all these new solutions showing up) has supported SSML for some time.
AWS Polly SSML docs: https://docs.aws.amazon.com/polly/latest/dg/ssml.html
First, we want to the quality you get out-of-the-box to already by brilliant by taking context into account. Granted, that gets you sometimes 98% there and are working to add manipulation possibility to get you to that 100%; for long-texts though the quality you get is great.
For second part, currently TTS providers give complicated toggles that frequently don't affect the speech in the way you want. Initially we are adding a basic SSML-like support and have a more robust language-based idea which we hope will come over the next few months!
I wonder, though, how much training people would need to understand what adjustments need to be made. Experienced actors and narrators should have a good sense of what to fix, but many people might have trouble identifying what sounds strange in the initial TTS output and how it needs to be changed.
There are speech synthesis markup languages, like SSML. And targeting even lower level has always been possible with commercial speech engines.
Think about how tedious and time consuming it is to mark up a large amount of copy? Unless we’re talking about little hints here and there (which is also doable) it rapidly becomes more cost effective to just pay for voice talent. For this stuff to be appealing it really must be close to fire and forget.
The first is being able to correct a few things that sound off, as another poster pointed out. "Hey, that's not actually how you pronounce 'synecdoche', it should be 'sɪˈnɛk.doʊ.k'." Or "Less emphasis on the first word, more on the second". Little corrections like that. I imagine a two-stage process where the first generates 'best guess' SSML (or whatever markup) based on the text. Then the content creator can modify it as necessary before it goes into the second step of actual voice synthesis.
The second sweet spot is when your text is dynamically generated. Marking up the entire copy might be a lot of work for pre-written text, but it's a great option for dynamically generated text.
Even with the prompts marked up, there were huge differences between products. Some car OEMs would pay higher fees for better voice and some wouldn't. It's fairly tedious work and difficult to scale as the amount of sentences grows. We basically built up a catalog over many years and they were always explicitly stated as part of our requirements docs. Of course the renderers could say anything you wanted but letting it free form was so a big risk from a product point of view.
[1] https://en.m.wikipedia.org/wiki/Speech_Synthesis_Markup_Lang...
Tortoise lets you add prompts into the text like [I am angry] which modifies the voice interestingly.
Going back further, there was also a prosody markup for Sound Blaster speech synthesis (Dr. Sbaitso, anyone?).
It's called SSML.
We are working on long-form speech synthesis too, needless to say, the audio reading the article has been also synthesized by a voice that does not exist.
It's starting to very much feel like we're entering the age of information manipulation outlined in the Ghost in the Shell TV series. Except it isn't a 90's/00's depiction of the future, it's just with far less robots and prosthetics and a lot more mundane.
I just keep coming back to the scene where they have satellite video footage of a nuclear submarine preparing for a nuclear attack and the discussion lamenting that it's just video, nobody will believe it as evidence.
Doesn't have to be prerecorded, just trained
Damn that would do a ton for immersion!
https://kotaku.com/neuro-sama-twitch-vtuber-ban-holocaust-mi...
Budget clients are suspicious of AI voices and feel "cheated" if they think someone they hired are using one. This will change fastest.
It would also be cool if celebrities / existing voice talent could somehow license the synthesis of their voice. I read something about James Earl Jones doing this with Disney for future Star Wars projects. I'm sure there are people out there who would love to have every work they listen to be in the voice of their favorite narrator/celebrity.
"The first AI that can laugh" - https://blog.elevenlabs.io/the_first_ai_that_can_laugh/
They don’t have to work anymore, just sell their voice and sit at home collecting royalty payments is the future according the TFA.
And they’ve been making progress on the roboticness with every new model that comes out. Just a matter of time (and data) for the AIs to figure out how words string together naturally.
The conversational one wouldn't, although it could pass for a bad (human) voice actor.
It does mimic the ups and downs of voice but they don't add up. The don't make sense. They don't really have any connections with what is being spoken.
But since it can do expressions, it probably only needs special markers in text to tell it how to really read a sentence.
In any case, I'm hoping this can be expanded to other languages as it would be an amazing tool for language learning.
We do support Polish already and the quality is actually better IMO than English as we use a newer generation model: https://www.youtube.com/watch?v=ra8xFG3keSs Some people think it is fake and we hired a real voice actor to read.
This seems to me where The Big Guys are going to dominate because it comes down to a big data problem. For example, whisper (admittedly speech to text) was trained on 480,000 hours of speech data scraped from the web. The next ‘contender’ used something like 48,000 hours. Who can compete with that who doesn’t own a whole cloud?
These are really impressive results! For anyone interested, my team’s singing work: https://youtu.be/LPy20zSWhZA)
Also don't put gumi and English in the same search query on YouTube. I don't know how they did it but the voices from six years ago sound better than SOTA TTS based on deep learning today...
Obviously that could come with some serious security risks, but it would also make content presentation much easier for many people. Gone are the days of doing voiceover recordings for videos.
wow
If you want to clone a voice and have a shitton of compute to fine tune it’s a good one.
If you just want your computer to tell you you need to be out the door in 30 seconds or you’ll miss the bus then not so much.
> On a K80, expect to generate a medium sized sentence every 2 minutes.
Are you aware of other available?
Is that even a thing? You can't copyright a voice. There can be a personality right under state law, but the main case on that was someone hired to sound like Bette Midler for a commercial.
Unlike Stable Diffusion trampling over the copyright of artists without their permission and OpenAI doing the same for code mangled with incompatible licenses and monetizing it and outputting the trained data verbatim whilst opening a pandora's box and then attempting to write detectors and watermarks afterwards. I'm skeptical on Eleven Lab's statement on adding their detectors before release, but we'll see.
Should there eventually be an open source version of a competing model by someone, it should be trained on public domain sources. This was the case with Dance Diffusion as Stability AI would have been sued to the ground by the RIAA had they done that. [0] [1]
It will only be a matter of time before the legal system catches up with AI generated content and scrutiny over the trained data on copyrighted content without permission and how it was trained. Any output generated by an AI is automatically public domain and un-copyrightable. [2]
This AI hype is another VC scam to unload their investments in AI startups to big tech once again and then pretend how AI is making the world better but when they know it is actually doing the opposite with far reaching consequences. Of course it can't be stopped, but it also cannot go unchecked and unregulated forever.
[0] https://www.musicbusinessworldwide.com/record-industry-clamp...
[1] https://techcrunch.com/2022/10/07/ai-music-generator-dance-d...
[2] https://www.copyright.gov/rulings-filings/review-board/docs/...
The crypto hype was a vehicle for VC backed companies to sell unregulated financial products to retail investors.
The "gig economy" was a vehicle for VC backed companies to skirt labor protections and zoning laws.
And now AI is a vehicle for VC backed companies to skirt copyright laws.
"Disruption" is often just about finding edge cases of existing laws and regulations and exploiting them for profit until legislation catches up.
Hardly. It is no different to my observations years ago. [0] [1]
[0] https://news.ycombinator.com/item?id=21738233
[1] https://news.ycombinator.com/item?id=27493369
Just like the clamping down of cryptocurrency markets and enforcement of regulations, a similar set of rules and regulations will be set for AI companies for complying with existing copyright laws.
The VCs know it is a scam and they are also smart enough to know that this won't go unchecked forever and they will have to unload their investment at the peak of the hype cycle.
The male narrative voice is silky smooth. In fact I prefer to the classic YouTube male mystery voice that sounds like the narrator had a lobotomy.
It would have been interesting how they perform in comparison. The fact you are able to adjust the voices is even one of their selling points. I really wonder why they haven't done that
These two words made it all sound like they are just trying to ride the AI wave instead of actually solving a real world problem
Disclaimer: I use tons of audiobooks so that might not be what people need in general
The current real time options I've found are... lacking, they are mostly fake/toys (not actually using voice cloning, just old school pitch shifting) or tech demo videos, with a scattering of research papers which are highly variable in terms of "how easily can i reproduce this", ranging from "sure if I want to waste money on a google colab instance, to "only works with specific model of video card due to reasons"
If you know of any real-time (audio stream in -> audio stream out) voice cloning/transform/replacement tools, feel free to post about them in a reply, this is an area of tech I'm trying to keep on top of and I'm only human so I have no idea what new company or research I might miss.
If I didn't know better I would have thought it was recorded by a person who was uncomfortable having their voice recorded.
Still insanely impressive though.
And yes this was a snarky answer because even if you don't realize it, it was a snarky question.
When you listen to the first example labelled "Narrative" you can tell where a human speaker would have inhaled (which is something the AI could have picked up on from copious training data) though the inhale itself could be muted in post-editing, e.g. after the long 24-word first phrase[1] ending in "special magnificence", and then again at the end of the sentence. It could just be the way the AI reads the comma but it is very convincing.
The "News" and "Conversational" examples don't include that pause effect. In the cerulean monologue, there is no pause after "for instance" despite it being in the monologue.
However, the robot takes a deep dramatic breath after the word "I see"[2]. " Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet and you select I don't know that lumpy blue sweater for instance because you're trying to tell the world that you take yourself". There is no pause on the comma around "for instance" though the script has one. I decided to check whether the robot is just copying the original film exactly and that's not it either.[3]
Comparison:
Robot: "Oh, okay. I see, [DEEP LOUD DRAMATIC BREATH BY ROBOT], you think this has nothing to do with you. [LOUD DRAMATIC HALF BREATH BY ROBOT] You go to your closet [no breath] and you select I don't know that lumpy blue sweater for instance [QUICK HALF BREATH BY ROBOT] because you're trying to tell the world [no breath] that you take yourself too seriously to care about what you put on your back but [no breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Original: "Oh, okay. I see [no breath] you think this has nothing to do with you. [loud long breath] You go to your closet [breath] and you select I don't know that lumpy blue sweater for instance [no breath] because you're trying to tell the world that you [breath] take yourself too seriously to care about what you put on your back but [breath] what you don't know is that sweater is not just blue it's not turquoise it's not lapis it's actually cerulean."
Text:
"Oh, okay. I see, you think this has nothing to do with you.You… go to your closet, and you select… I don’t know, that lumpy blue sweater for instance, because you’re trying to tell the world that you take yourself too seriously to care about what you put on your back, but what you don’t know is that that sweater is not just blue, it’s not turquoise, it’s not lapis, it’s actually cerulean. "
I've annotated the breaths in the "conversational" robot sample vs the original film:
Robot Original Same/different?
I see... [Loud breath] [no breath] Different
with you... [Loud quick breath] [loud long breath] Similar
your closet... [no breath] [breath] Different
for instance... [QUICK half breath] [no breath] Different
that you... [no breath] [breath] Different
back but... [no breath] [breath] Different
The robot's loud dramatic breath is unmistakable, but it's clear it's not copying the source exactly, since it occurs at different places.[1] The text is here: https://www.nytimes.com/2001/11/19/books/chapters/the-lord-o...
[1] The text is here: https://artdepartmental.com/blog/devil-wears-prada-cerulean-...
From the users so far they found it actually enjoyable to listen to and that the breathing and pauses are accurate!
As a developer, can you tell the difference between "Narration" and the human speaker? What can we listen for or what gives it away? For me I listened to the "Narration" clip many times and as a native British English speaker also confirms in another comment, it seems very difficult/impossible to tell the first clip is generated. Congratulations on such an achievement!
Later when I went back and listened carefully for why the first clip felt so "real" I noticed it had pauses. (No breaths per se but they are sometimes removed from edited audio.) However, I then noticed that the conversational clip, which felt unnatural to me, had very obvious breaths. The entire effect of the conversational clip didn't sound like a human at all. It sounded like an AI.
Did you find the whole conversational clip "convincing"? (Did it sound like a human to you?) How about the narration clip?
That feels dishonest. Even if this AI is just as good at speaking as a professional voice actor (which I'm not sold on), a voice actor does more than just read the line. In ideal circumstances, they have a lot of context for what their character is doing and feeling.
Is this potentially a good option for saving money on video game voices? Quite possibly yes. Is there no compromise on quality? No, not yet.
Past that, the whole "Ethical AI" section's arguments seem ridiculous. Of COURSE it puts the livelihoods of voice actors at risk. Your product's whole point is that fewer man hours are needed for voice work. Just accept that you're making those jobs obsolete. There's a perfectly good argument that it's okay to do that. Throwing bullshit at us to convince us that "no, the voice actors will still have lots of work, and they won't even have to talk!" just makes you sound like snake oil salesmen.
How long do you think that advantage will last for. Years? Months? Weeks?
I think the 'narration' example was voice actor quality. The other two were slightly off.