undefined | Better HN

0 pointsdanso10y ago0 comments

I haven't tried out Pocket Sphinx myself...could you describe the training process, e.g. how long did it take, how much audio did you have to record, how easy was it to iterate to improve accuracy?

0 comments

3 comments · 1 top-level

lovelearning10y ago· 2 in thread

PocketSphinx/Sphinx use three models - an acoustic model, a language model and a phonetic dictionary. I'm no expert, but as I understand them, the acoustic model converts audio samples into phonemes(?), the language model contains probabilities of sequences of words, and the phonetic dictionary is a mapping of words to phonemes.

Initially, I just used standard en-us acoustic model, US english generic language model, and its associated phonetic dictionary. This was the baseline for judging accuracy. It was ok, but neither fast nor very accurate (likely due to my accent and speech defects). I'd say it was about 70% accurate.

Simply reducing the size of the vocabulary boosts accuracy because there is that much less chance of a mistake. It also improves recognition speed. For each of my use cases (home and desktop automation), I created a plain text file with the relevant command words. Then used their online tool [1] to generate a language model and phonetic dictionary from it.

For the acoustic model, there are two approaches - "adapting" and "training". Training is from scratch, while adapting adapts a standard acoustic model to better match personal accent or dialect or speech defects.

I found training as described [2] rather intimidating, and never tried it out. This is likely to take a lot of time (a couple of days atleast I think, based on my adaptation experience).

Instead I "adapted" the en-us acoustic model [3]. About an hour to come up with some grammatically correct text that included all the command words and phrases I wanted. Then reading it aloud while recording using Audacity. I attempted this multiple times, fiddling around with microphone volume and gain, trying to block ambient noise (I live in a rather noisy env), redoing it, final take. Took around 8 hours altogether with breaks. Finally generating the adapted acoustic model. About an hour.

About 95% of the time it understands what I say. About 5% of the time, I have to repeat. Especially with phrases.

Did this on both a desktop and raspberry pi. The Pi is the one managing home automation. I'm happy with it :)

[1]: http://www.speech.cs.cmu.edu/tools/lmtool-new.html

[2]: http://cmusphinx.sourceforge.net/wiki/tutorialam

[3]: http://cmusphinx.sourceforge.net/wiki/tutorialadapt

PS: Reading their documentation and searching for downloads takes more time than the actual task. They really need to improve those.

vram2210y ago

If not confidential, can you describe what kinds of automation you used this for, particularly the desktop automation?

I was interested in automating transcription to text of my own reminders to myself and other such audio files, say taken on the PC or on a portable voice recorder, hence the earlier trials I did. But at the time nothing worked out well enough, IIRC.

lovelearning10y ago

Nothing confidential at all :). I was playing with them because I personally don't like using keyboard and mouse, and also have some ideas for making computing easier for handicapped people.

My current desktop automation is doing command recognition. Commands like "open editor / email / browser", "shutdown", "suspend"...about 20 commands in all. 'pocketsphinx_continuous' is started as a daemon at startup and keeps listening in the background (I'm on Ubuntu).

I think from a speech recognition internals point of view transcription is more complex than recognizing these short command phrases. The training or adaptation corpus would have to be much larger than what I used.

1 more reply

j / k navigate · click thread line to collapse