undefined | Better HN

0 pointsjpcl2y ago0 comments

For Polish I have around 700hr. I suspect that we will need less hours if we add more languages since they do overlap to some extent.

Fixed transcripts would be nice although we need to align them with the audio really precisely (we cut the audio into 30 second chunks and we pretty much need to have the exact text in every chunk). It seems this can be solved with forced alignment algorithms but I have not dived into that yet.

0 comments

yorwba2y ago

I have forced alignments, too.

E.g. for the True Story of Ah Q https://github.com/Yorwba/LiteratureForEyesAndEars/tree/mast... .align.json is my homegrown alignment format, .srt are standard subtitles, .txt is the text, but note that in some places I have [[original text||what it is pronounced as]] annotations to make the forced alignment work better. (E.g. the "." in LibriVox.org, pronounced as 點 "diǎn" in Mandarin.) Oh, and cmn-Hans is the same thing transliterated into Simplified Chinese.

The corresponding LibriVox URL is predictably https://librivox.org/the-true-story-of-ah-q-by-xun-lu/

jpclOP2y ago

Thanks, I'll check it out. I don't know any Chinese so I'll probably reach out to you for some help :)

1 more reply

thorum2y ago

You might check out this list from espnet. They list the different corpuses they use to train their models sorted by language and task (ASR, TTS etc):

https://github.com/espnet/espnet/blob/master/egs2/README.md

j / k navigate · click thread line to collapse