PlayAI's new Dialog model achieves 3:1 preference in human evals (opens in new tab)

(play.ht)

95 pointslegofan941y ago55 comments

55 comments

40 comments · 15 top-level

quickgist1y ago· 8 in thread

For some reason, most of these (and other narration AIs) sound like someone reading off a teleprompter, rather than natural speaking voices. I'm not sure what exactly it is, but I'm left feeling like the speaker isn't really sure of what the next words are, and the stresses between the words are all over the place. It's like the emphasis over a sentence doesn't really match how humans sound.

crazygringo1y ago

Yup, and that's going to be the case until AI's can really model human psychology.

Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.

If you try to reproduce all the normal speech prosody, it'll be all over the place and SoUnD bIzArRe and won't make any sense, and be incredibly distracting, because there's no coherent psychology behind it.

So "reading off a teleprompter" is really the best we can do for now -- not necessary affectless, but with a kind of "constant affect", that varies with grammatical structures and other language patterns, but no real human psychology.

It's a gigantic difference from text, which encodes vastly less information.

(And this is one of the reasons I don't see AI replacing actors for a looong time, not even voice actors. You can map a voice onto someone else's voice preserving their prosody, but you still need a skilled human being producing the prosody in the first place.)

tyre1y ago

What if you have it read the script, then say, “hey, at this point, what is the character feeling? What are they trying to accomplish? What is there relationship to each person in the scene?”

And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.

It seems like it could definitely do the first part (“based on this text, this character might be feeling X”); the second part (“mark up the dialogue”) seems easier; the third part about speech seems doable already based on another comment.

So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.

1 more reply

neom1y ago

You have to shape the voices in the tools, if you just spit them out they're junk but if you take the time to shape the voice a bit it gets better quickly, this is a cheap 11labs voice with 30 seconds spent on some basic shaping: https://s.h4x.club/bLuNlJWx

Still a bit teleprompter-ish but there are tools to go in and adjust pace and style throughout and you probably hear a lot of stuff with people not using those creative features. 11labs might very well be one of the best bits of software I've used, it's a great deal of fun to play with and if you're willing to spend the time the results are superb - I don't even have a use case, I just like making them because they're fun to listen to, ha!

hgarg1y ago

What is shaping the voice means?

1 more reply

echelon1y ago

PlayHT's voices are nowhere near as good as ElevenLabs. These self-reported studies are marketing.

In any case, voice is such a thin vertical that I half expect the Chinese to release an open source TTS model that out-performs everything on the market. Tencent probably has one of these cooking right now.

hedora1y ago

Oof. I've heard recent AI generated narrators, and they were OK (much better than a few years ago, much worse than professional humans), but something about the digital postprocessing in this article's youtube video reminded me of fingernails on a chalkboard.

I couldn't get half way through.

refulgentis1y ago

This is an excellent point: there's some sort of oddity where it's hard to say its definitively AI, but I can definitively say it's a...low quality human?

I really like "off a teleprompter", it accurately characterizes the subtle dissonances where it sounds like someone who is reading something they haven't read before. 0:14 "infectious (flat) beatsss (drawn out), which is near diametrically opposed to the paired snappy 0:12 "soulful (high / low) vocals (high)."

SOVIETIC-BOSS881y ago

Most of these models are trained on audiobooks, which could explain the teleprompter feeling vs a natural conversational feeling.

tkgally1y ago· 6 in thread

I tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.

But then I tried a paragraph of Japanese text, also from a formal speech, with the language set to Japanese and the narrator set to Yumiko Narrative. The result was a weird mixture of Korean, Chinese, and Japanese readings for the kanji and kana, all with a Korean accent, and numbers read in English with an American accent. I regenerated the output twice, and the results were similar. Completely unusable.

I tried the same paragraph on ElevenLabs. The output was all in Japanese and had natural intonation, but there were two or three misreadings per sentence that would render it unusable for any practical purpose. Examples: 私の生の声 was read as watashi no koe no koe when it should have been watashi no nama no koe. 公開形式 was read as kōkai keiji instead of kōkai keishiki. Neither kanji misreading would be correct in any context. Even weirder, the year 2020 was read as 2021. Such misreadings would confuse and mislead any listeners.

I know that Japanese text-to-speech is especially challenging because kanji can often be read many different ways depending on the context, the specific referent, and other factors. But based on these tests, neither PlayAI nor ElevenLabs should be offering Japanese TTS services commercially yet.

XenophileJKO1y ago

So kind of unrelated, but the reading/singing of arbitrary custom lyrics on suno.com's v4 model has blown me away.

neom1y ago

suno is uncomfortably good. I run a group for helping founders and sometimes I make little suno songs to accompany the classes for fun, always impressed by what it spits out. (prompt: song for founder who have happy ears bringing them tears > 30 seconds gen >) https://s.h4x.club/p9u4ezl2 / https://s.h4x.club/mXuND7Eb / https://s.h4x.club/L1u2DYzW

2 more replies

popalchemist1y ago

Alternatively, text that is input to these services should be passed through a normalization process, i.e. use LLAMA to convert kanji to hiragana or a romanization. The TTS output is then much better.

laurieg1y ago

Unfortunately, a simple normalization of kanji --> hiragana throws away pronunciation information.

1 more reply

ekianjo1y ago

Yeah there are no good options for Japanese yet (except maybe in Japan but I haven't heard of good Ai models for speech locally)

thot_experiment1y ago

Anecdotally gpt-sovits is quite good at japanese, I can't evaluate first hand as my japanese is trash.

thot_experiment1y ago· 4 in thread

I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:

- zero shot voice cloning isn't there yet

- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps

- F5 and fish-speech are both good as well

- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)

- finetuning an xtts model for a few epochs on a particular speaker does wonders, if you have a good utterance library w/ emotions conditioning a finetuned xtts model with that speaker expressing a particular emotion yields something very usable

- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on

- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid

- really creepy indistinguishable at a casual listen voiceclones of arbitrary people are possible with as little as 30 minutes of speech, the resultant quality captures mannerisms and pacing eerily well, it's easy to get clean input data from youtube videos/podcasts using de-noising/vocal extraction neural nets

TL;DR; use XTTS and pipe it into seed-vc, the e2e on that pipeline on my machine is something like 2x realtime and generates very highly controllable natural sounding voices, you have to manually condition emotive speech

cjonas1y ago

What scenario would be considered "zero shot" voice cloning?

At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).

IanCal1y ago

I think the key distinction is that there is no specific training data for that speaker. You can view the input as just the input voice to clone, not training examples.

It would be more like training examples if you had to give it specific phrases.

ekianjo1y ago

Xtts is non commercial use only though

thot_experiment1y ago

I think XTTS is MPL now since Coqui folded, but I am not a lawyer and I am not using this for anything commercial so I haven't looked closely.

masto1y ago· 3 in thread

I love the tech. I hate that it gets used to fill YouTube with zero-effort slop. I don't have a solution.

OskarS1y ago

I realize it’s hard to face, but it’s ok to admit that (cool as the tech is) some things just have a net negative on the world. It’s just engineers and data scientists using their enormous talents to make world a worse place, instead of a better one.

It’s not going to happen, but the only solution is to just stop developing it.

sfink1y ago

The cat's out of the bag, if someone stops then someone else would start.

I would entreat people to consider the net effect of anything they create. Let it at least sway your decisions somewhat. It probably won't be enough to not do it, but I think of it more as the ratio between net positive :: net negative, and paying attention to that ratio should help swing it at least somewhat -- certainly more than giving up and ignoring the benefits :: harms.

1 more reply

currymj1y ago

one issue is that having a good generative model is unavoidable as a component for lots of good, useful tasks. like translation, transcription, etc.

but then of course if you have a generative model, you can use it to generate stuff.

adriand1y ago· 2 in thread

Do these services restrict the content that their AIs give voice to? If so, what are the typical restrictions? Like do they seek to prevent their tech being used for scamming, erotica, hate speech, etc? Or is it pretty much anything goes?

nine_k1y ago

How, do you think, can they restrict that? Require that in the EULA, then sue someone who breaks the rules at a scale large enough to be worth the cost of the lawyers?

Or do you think they should analyze the text's sentiment and raise a flag if the sentiment is obviously breaking the EULA, e.g. some kind of hate speech?

How would you implement that?

adriand1y ago

Content moderation on text input is how I would assume it would be done, essentially the same way LLMs work now. But to be clear, I am not advocating for it. I’m just asking if they do it.

vessenes1y ago· 2 in thread

So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!

legofan94OP1y ago

Thanks Peter! We think it really crushes for emotive text. Anything from storytelling to being emotionally reassuring. Still a lot of things up our sleeve too!

vessenes1y ago

I have a particular use case I’m interested in using agents for - any chance you want to have a call?

In brief I’d like to be able to generate conversations via api choosing voices that should be unique on the order of thousands. Essentially I’m trying to simulate conversations in a small town. Eleven is not set up for this.

Ideally I’d be able to pick a spot in latent space for a voice programmatically. But I’m open to suggestions.

wilg1y ago

I wanted to use this to do temp voices for a video game project. Not realtime, just creating the audio at build time basically. However, the pricing model is not conducive to that because you cannot pay-per-use, and on top of that none of the lower cost plans support more than 10 requests per minute so its difficult to use for batch operations. $299/mo seemed steep for my use case of infrequent bursts, and they couldn't help me with a custom plan, so I have ended up just using Azure AI Text-to-Speech. (Which is also much faster to render.)

jpkw1y ago

Testing with "Banana... Banana???? Banana!!!!!" Yields interesting results each time, and none so far the way a human would read it.

jesperwe1y ago

Maybe launched a bit too quickly? You can select "Swedish" for the non-Swedish voices, but the results are very poor. Far from useable. And there is no Swedish voice. So that language support claim is made a bit to soon I would say.

Also I found no way to filter/sort the voice selection modal on language, so I have to visually search the entire list.

pj_mukh1y ago

To the founders: Would love to share my audio files with my team before we commit to a payment plan. Is there anyway to share audio files I've generated?

Our whole team is on Elevenlabs and a switch is significant work, but I think the results are worth it! Super awesome work!

hsbauauvhabzb1y ago

I have a 3:1 preference for services that don’t unnecessarily require a mobile number to sign up.

WhitneyLand1y ago

Interesting OpenAI is left out from the comparison. I know they don’t do cloning, but there’s definitely overlap for narratives.

I wonder which comparison they hope to avoid, quality or price?

SeanAnderson1y ago

wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.

chzp941y ago

Awesome!

visarga1y ago

I signed up and right from the start I have -3 words left. LOL

j / k navigate · click thread line to collapse

55 comments

40 comments · 15 top-level

quickgist1y ago· 8 in thread

crazygringo1y ago

Yup, and that's going to be the case until AI's can really model human psychology.

Speech encodes a gigantic amount of emotion via prosody and rhythm -- how the speaker is feeling, how they feel about each noun and verb, what they're trying to communicate with it.

It's a gigantic difference from text, which encodes vastly less information.

tyre1y ago

What if you have it read the script, then say, “hey, at this point, what is the character feeling? What are they trying to accomplish? What is there relationship to each person in the scene?”

And then you get that and prompt the model to add inflection and pacing and whatever to the text to reflect that. You feed that into the speech model.

So we are pretty close already? Whatever actors are doing can be approximated through prompting, including the director iterating with the “actors”.

1 more reply

neom1y ago

hgarg1y ago

What is shaping the voice means?

1 more reply

echelon1y ago

PlayHT's voices are nowhere near as good as ElevenLabs. These self-reported studies are marketing.

hedora1y ago

I couldn't get half way through.

refulgentis1y ago

This is an excellent point: there's some sort of oddity where it's hard to say its definitively AI, but I can definitively say it's a...low quality human?

SOVIETIC-BOSS881y ago

Most of these models are trained on audiobooks, which could explain the teleprompter feeling vs a natural conversational feeling.

tkgally1y ago· 6 in thread

I tried it with a paragraph of English taken from a formal speech, and it sounded quite good. I would not have been able to distinguish it from a skilled human narrator.

XenophileJKO1y ago

So kind of unrelated, but the reading/singing of arbitrary custom lyrics on suno.com's v4 model has blown me away.

neom1y ago

2 more replies

popalchemist1y ago

laurieg1y ago

Unfortunately, a simple normalization of kanji --> hiragana throws away pronunciation information.

1 more reply

ekianjo1y ago

Yeah there are no good options for Japanese yet (except maybe in Japan but I haven't heard of good Ai models for speech locally)

thot_experiment1y ago

Anecdotally gpt-sovits is quite good at japanese, I can't evaluate first hand as my japanese is trash.

thot_experiment1y ago· 4 in thread

I've been messing with the open source side of audio generation, and expressiveness still takes work but it's getting there. Roughly summarized my findings are:

- zero shot voice cloning isn't there yet

- gpt-sovits is the best at non-word vocalizations, but the overall quality is bad when just using zero shot, finetuning helps

- F5 and fish-speech are both good as well

- xtts for me has had the best stability (i can rely on it not to hallucinate too much, the others i have to cherrypick more to get good outputs)

- you can do speech to speech on the final output of xtts to get to something that (anecdotally) fools most of the people i've tried it on

- non finetuned XTTS zero shot -> seed-vc generates something that's okay also, especially if your conditioning audio is really solid

cjonas1y ago

What scenario would be considered "zero shot" voice cloning?

At the very least, wouldn't you have to provide 1 sample? Which would make it "few shot" (if that term really even makes sense in this context).

IanCal1y ago

I think the key distinction is that there is no specific training data for that speaker. You can view the input as just the input voice to clone, not training examples.

It would be more like training examples if you had to give it specific phrases.

ekianjo1y ago

Xtts is non commercial use only though

thot_experiment1y ago

I think XTTS is MPL now since Coqui folded, but I am not a lawyer and I am not using this for anything commercial so I haven't looked closely.

masto1y ago· 3 in thread

I love the tech. I hate that it gets used to fill YouTube with zero-effort slop. I don't have a solution.

OskarS1y ago

It’s not going to happen, but the only solution is to just stop developing it.

sfink1y ago

The cat's out of the bag, if someone stops then someone else would start.

1 more reply

currymj1y ago

one issue is that having a good generative model is unavoidable as a component for lots of good, useful tasks. like translation, transcription, etc.

but then of course if you have a generative model, you can use it to generate stuff.

adriand1y ago· 2 in thread

nine_k1y ago

How, do you think, can they restrict that? Require that in the EULA, then sue someone who breaks the rules at a scale large enough to be worth the cost of the lawyers?

Or do you think they should analyze the text's sentiment and raise a flag if the sentiment is obviously breaking the EULA, e.g. some kind of hate speech?

How would you implement that?

adriand1y ago

Content moderation on text input is how I would assume it would be done, essentially the same way LLMs work now. But to be clear, I am not advocating for it. I’m just asking if they do it.

vessenes1y ago· 2 in thread

So, this is really impressive. Expressivity and pacing are wayyy better. Eleven Labs has been tops for some time, but the difference is pretty remarkable!

legofan94OP1y ago

Thanks Peter! We think it really crushes for emotive text. Anything from storytelling to being emotionally reassuring. Still a lot of things up our sleeve too!

vessenes1y ago

I have a particular use case I’m interested in using agents for - any chance you want to have a call?

Ideally I’d be able to pick a spot in latent space for a voice programmatically. But I’m open to suggestions.

wilg1y ago

jpkw1y ago

Testing with "Banana... Banana???? Banana!!!!!" Yields interesting results each time, and none so far the way a human would read it.

jesperwe1y ago

Also I found no way to filter/sort the voice selection modal on language, so I have to visually search the entire list.

pj_mukh1y ago

To the founders: Would love to share my audio files with my team before we commit to a payment plan. Is there anyway to share audio files I've generated?

Our whole team is on Elevenlabs and a switch is significant work, but I think the results are worth it! Super awesome work!

hsbauauvhabzb1y ago

I have a 3:1 preference for services that don’t unnecessarily require a mobile number to sign up.

WhitneyLand1y ago

Interesting OpenAI is left out from the comparison. I know they don’t do cloning, but there’s definitely overlap for narratives.

I wonder which comparison they hope to avoid, quality or price?

SeanAnderson1y ago

wow! The preview is amazing! I would've 100% assumed those were human narrations if I wasn't given leading context.

chzp941y ago

Awesome!

visarga1y ago

I signed up and right from the start I have -3 words left. LOL

j / k navigate · click thread line to collapse