undefined | Better HN

0 pointsoezi5mo ago0 comments

> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

0 comments

jasonjayr5mo ago

IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.

dotancohen5mo ago

Where can I read about this?

jasonjayr5mo ago

> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

1 more reply

j / k navigate · click thread line to collapse

0 pointsoezi5mo ago0 comments

> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

* = https://github.com/coezbek/PlayDiffusion

0 comments

jasonjayr5mo ago

dotancohen5mo ago

Where can I read about this?

jasonjayr5mo ago

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

1 more reply

j / k navigate · click thread line to collapse