Nerd-dictation, hackable speech to text on Linux (opens in new tab)

(github.com)

202 pointsideasman424y ago62 comments

62 comments

48 comments · 11 top-level

abetusk4y ago· 7 in thread

I've never even heard of VOSK-API [0], the underlying offline speech to text engine that this project uses.

Does anyone have experience using it? Is it any good?

[0] https://github.com/alphacep/vosk-api

commoner4y ago

Vosk powers Dicio, a free and open source voice assistant for Android. If you have an Android device, this app is another way to try out Vosk:

- F-Droid: https://f-droid.org/packages/org.dicio.dicio_android/

- Source: https://github.com/Stypox/dicio-android

- HN: https://news.ycombinator.com/item?id=29762526

The accuracy of the English language recognition is not bad. I'm glad to see an implementation of Vosk for desktop Linux.

flas9sd4y ago

can second Dicio to give Vosk a try. For a local model it worked surprisingly well. But you can't yet mix languages mid sentence - difficult when searching for restaurants that have english names but are not located in a english speaking country.

woodson4y ago

Vosk-api isn't an SST engine itself, it is built using the Kaldi speech recognition toolkit (https://github.com/kaldi-asr/kaldi) and nicely implements and packages an API for Kaldi chain/LF-MMI models.

Arnavion4y ago

I use it to transcribe English robocalls. Vosk gets all the words right as long as I use the "Accurate generic US English" model. PocketSphinx (with the default en-us.lm.bin model in the distro package, no idea what it is) didn't get a single word right IIRC. I didn't try anything else.

follower4y ago

Yeah, I was really impressed with the project when I encountered it last year when trying out a bunch of FLOSS Speech-To-Text options.

It was significantly better than the other FLOSS options I looked at--both in terms of getting it going initially & the quality of the speech to text results.

I tested it with a lightly modified version of this example script: https://github.com/alphacep/vosk-api/blob/master/python/exam...

What I found particularly interesting was when you have the "partial" recognition output shown in real-time you get to see how--at the end of a sentence--it may change a word earlier in the sentence in the final recognition output based on (I guess) the additional context of the full sentence.

(I just did a quick test again (with the installs from my testing last year) using an internal laptop microphone & the test script recognized a significant chunk of my speech (using a headset definitely improves things though) whereas with the same environment a test with `mic_vad_streaming` (from `DeepSpeech-examples-r0.9` with `deepspeech-0.9.0-models.pbmm`) failed to recognize any words at all.)

Nimitz144y ago

It's very well known among ppl who know the field. It's quite good, the lead has a nice blog too.

foothebar4y ago

Results depend heavily on which speech files you use. You can even guess which it was, looking at the errors it makes.

yjftsjthsd-h4y ago· 6 in thread

This is better than any other speech-to-text setup I've ever encountered, for one simple reason: I followed the dead-simple install steps in the readme, started the program, and it worked. Bonus points for the install being a git clone and pip install away. I don't know why this is a hard bar to clear, but bravo. (I suspect that it's because a lot of FOSS speech recognition is from academia where "follow the following 13 steps, including hand-crafting recognition parameters" is more normal and acceptable because everyone involved is already a domain expert, whereas I, as a user, just want "plug in a mic, run this thing, and get text on stdout".)

melony4y ago

Most TTS and speech synthesis can be easy to install if you get rid of the GPU requirement. Both AMD and Nvidia have horrible workflows for installing their drivers and neural network/linear algebra kernels. Real time speech recognition/synthesis on generic consumer grade Intels/AMD cores is very, very difficult to do well which is why most providers are cloud based. (The alternative is targeting Mac only as they have standardized hardware everywhere)

yjftsjthsd-h4y ago

Yeah, text-to-speech is and has been easy for ages; I'm pretty sure I used espeak like a decade ago. On the other hand, I have tried... pretty much all the big names in speech-to-text, without success, or at best "I kind of got the demo to work but couldn't figure out how to do anything useful with it". Kaldi, sphinx, julius, a handful of tiny PoC things I found online... maybe I'm just bad at following instructions, or I'm trying to do something that they're not trying to optimize for, but I have not had a good time.

MisterTea4y ago

You used to be able to configure, make and then install software from source quite easily. Maybe you were missing devel a lib or two so you installed them. Now it's grown into a dependency hell that's so bad we now employ a myriad of package managers and container platforms to manage the disaster. I blame the package manager fetish.

johnisgood4y ago

> I followed the dead-simple install steps in the readme, started the program, and it worked. Bonus points for the install being a git clone and pip install away. I don't know why this is a hard bar to clear, but bravo.

Woah, we really do write 2022.

yjftsjthsd-h4y ago

What?

johnisgood4y ago

What you said used to be the standard. In fact, it used to be using your Linux distributions package manager which is even more convenient. I cannot even imagine a piece of software that you cannot get working as easily as cloning the git repository and then following the instructions, instructions that are typically pretty easy to follow and are supposed to work, or using a programming language's package manager, or your Linux distributions or BSD's package manager.

At any rate, what I am trying to say is that if the case of having poorly documented (i.e. usually untested documentation) piece of software is high, then we definitely are doing something wrong. You should be able to follow the installation instructions and it should work, i.e. just read INSTALL or README and follow the instructions, like good old times!

You said it yourself: "I don't know why this is a hard bar to clear, but bravo.". It should not be, it should be expected, and it should be done. It should not be a magical or surprising thing.

2 more replies

zelphirkalt4y ago· 6 in thread

Has anyone used this somehow inside Emacs or knows how to make Emacs take its output and put it into a buffer?

dotancohen4y ago

Just open emacs.

This program outputs like a keyboard. And, in English at least, it works really well. I cannot believe it.

tmalsburg24y ago

I’ve used it for a while in German and in English and I was impressed, too, with its recognition performance. Even the small and therefore fast language models perform decently. However, a major downside is that it doesn’t do punctuation, new lines, capitalization, and similar. This means that you have to edit a lot of the recognized text by hand, which obviously spoils the fun. Having said that, the front end code is in python and you can easily hack it. With a small handful of lines of code I was able to address some of these issues somewhat.

nshm4y ago

We are working on punctuation/capitalization. Models are ready but not yet integrated. German BERT-based model is available at https://alphacephei.com/vosk/models#punctuation-models for example. Should be ready soon.

1 more reply

dotancohen4y ago

Reading my comment back, right here in the browser:

  > just open a max this program output like a keyboard and in
  > english at least it works really well i cannot believe it

The only problems that I see are:

1. Capitalization and punctuation.

2. Doesn't know what emacs is, so it got that wrong. A user-installed dictionary might help here.

3. "outputs" came out as "output". I just tried a few more times, and I got the same results. I suspect that like "emacs", the word "outputs" is not in the dictionary.

ideasman42OP4y ago

Regarding 1) I have 2 key bindings, 1 that starts a sentence and another binding but doesn't. while punctuation remains an issue (comers brackets and question marks for example) - since I'm mostly using this to save typing longer passages of text having to manually deal with punctuation isn't all that much of a hassle. But I can understand anyone attempting to go completely hands free would need something to support entering literal characters and punctuation.

1 more reply

plafl4y ago

It is a matter of time someone puts a dictation mode together. Given the design of Emacs key combinations you will be able (required) to use also rythm and tone.

kristopolous4y ago· 5 in thread

I'm throwing another hat in the ring as this technology totally working most of the time. I used it to write this comment.

This should make my life a lot easier because I find myself going to my phone and using the dictation feature a lot recently. It's not as good as the one on my android, but it's 95% of the way there.

dotancohen4y ago

Reading your comment with nerd-dictation returns this for me:

  > i'm throwing another hat in the ring as this technology totally working most
  > of the time i used it to write this comment this should make my life ah lot
  > easier because i find myself going to my phone and using third dictation
  > feature a lot recently it's not as good as the one on my android for it's
  > ninety five percent of the way fair

For use with no training that looks great. I'm sure that as I learn to speak more clearly, your 95% estimate is achievable.

ideasman42OP4y ago

Interesting are you using the full model? I found with a good microphone and the full 1 gigabyte language model that the quality is quite good compared to other people's phones I have used from time to time.

dotancohen4y ago

How did you add the punctuation and capitalization?

kristopolous4y ago

That was by hand but it's a small task.

dotancohen4y ago

I'm imagining remapping some VIM shortcuts for even easier capitalizing words and adding common punctuation.

This remap makes the "." key add a period after the previous word, capitalize the current word, then move on to the next word:

  :noremap . bea.<esc>w~w

2Gkashmiri4y ago· 3 in thread

You know... I have an idea. How about we use vosk and this tech to integrate with ffmpeg somehow so that peertube videos can get subtitles while being transcoded. Once we get English SRT, we could use libretranslate to translate that English SRT to multiple languages.

This could be similar to what YouTube does with it's automatic subtitles. What do you guys say?

follower4y ago

The package that this project is built on (vosk-api) mentions & includes some examples to demonstrate exactly that type of use case with ffmpeg:

"...continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification ... can also create subtitles for movies, transcription for lectures and interviews."

* https://github.com/alphacep/vosk-api/blob/master/python/exam...

(Edit: Also, thanks for introducing me to "libretranslate", looks like an interesting project.)

2Gkashmiri4y ago

great. someone should link peertube github with this. i am sure the great people will do it much faster and more elegantly :-)

1 more reply

harryvederci4y ago

Sounds great! Do it!

sundarurfriend4y ago· 3 in thread

I was wondering how well it dealt with accents, them I saw that the Vosk API page specifically mentions "English, Indian English, German, French, ..." :D I don't know the story behind "Indian English" specifically being listed as a separate language, but I'm glad to see it's supported.

bruce3434344y ago

Well, I'm not Indian but I can see it being a separate dialect, much like American English. For instance, an Indian might say "I have a doubt" instead of "I have a question". And as you mentioned, there is an accent, just like with American English.

dotancohen4y ago

Adding Indian English was definitely doing the needful.

sundarurfriend4y ago

That's true, but that's why I mentioned "specifically listed" in my comment - the same things apply to Australian English for eg., and possibly to Filipino English and South African English and many others. Indian English is the only _dialect_ mentioned in the list, the rest are languages, which made me curious.

The reason is probably something pragmatic, like perhaps a large enough corpus was available for that specifically.

suifbwish4y ago· 3 in thread

Very cool. Does it have an erotic voice? Asking for a friend.

commoner4y ago

Nerd Dictation does speech-to-text (voice recognition), not text-to-speech. If you want to speak to your computer in an erotic voice, nobody's stopping you.

jancsika4y ago

There is actually an API for this:

evStart: start speaking in an erotic voice

evStop: stop speaking in an erotic voice

evQuery: query whether you are speaking in an erotic voice

evLinuxClassic: enable the inability to speak until Firefox is closed (experimental)

dvh4y ago

Festival and install Scottish or French voice (whatever floats your boat)

allanrbo4y ago· 2 in thread

Nice. Another notable mention in this space is Talon. Useful for automating all OS tasks with voice commands, as well as just dictation: https://talonvoice.com/

yencabulator4y ago

Talon has an EULA that is enough to send me scrambling for the hills: https://talonvoice.com/EULA.txt

Meanwhile, the meat of this speech recognizition is Vosk, which is just Apache-2: https://github.com/alphacep/vosk-api/blob/master/COPYING

dotancohen4y ago

Running it as a user prompts:

  > $ ./run.sh 
  > [+] Prompting for admin to set up Tobii udev rule
  > [sudo] password for dotancohen:

That does not build trust. I would prefer an instruction on how to set up a udev rule, or better yet, I would prefer that requirement to be relaxed. What does it need more than standard microphone access that e.g. nerd-dictation or even Telegram need?

deknos4y ago· 2 in thread

is there an offline good program for text to speech for german,french,spanish,english? and no, festival and espeak are not what i would consider good.

the at&t website with text to speech as audio file which were used in these anonymous publications are good, but not espeak. if i had sth like this for european (and russian and arab languages) as open source standalone, i would be happy :(

follower4y ago

Yes!

The project is called Larynx, and it is amazing: https://github.com/rhasspy/larynx/

I waxed lyrical about it recently in this thread about private alternatives to Alexa: https://news.ycombinator.com/item?id=29562526

I can only vouch for the quality/variety in English but it does note support for 50 voices over 9 languages, including all the first group of languages you mentioned, and also Russian. (I've "played" with all those languages to test them but can't really vouch for how a native speaker/listener might find it. :D )

It is miles ahead of any of the other Free/Open Source TTS solutions I've tried, including the ones you mentioned.

(It's still synthesized speech but the output quality is so good and the project is still extremely early days.)

And there's a range of options in accent & gender--which are in general sorely lacking in other FLOSS TTS options. (In terms of licensing, some voices are licensed more freely than others but the majority are without significant restriction.)

I like Larynx so much that I've been working on an editor for it to assist in "auditioning" & recording speech in a narrative context, e.g. game/film pre-viz.

deknos4y ago

Thanks, i will look it up! Thank you! :)

phantom_oracle4y ago

This is such an amazing technology for the many tech people who are having to deal with hand/finger/elbow issues after extensive usage for years on their keyboards.

I was looking for this type of tech for at least 2 years and I am glad it now exists.

FOSS is amazing!

zoomablemind4y ago

Vosk, is it "wax" in Russian ("воск")?

I think of wax recording rolls - old days CDs, aka Phonograph cylinder:

https://en.m.wikipedia.org/wiki/Phonograph_cylinder

1 more reply

j / k navigate · click thread line to collapse

62 comments

48 comments · 11 top-level

abetusk4y ago· 7 in thread

I've never even heard of VOSK-API [0], the underlying offline speech to text engine that this project uses.

Does anyone have experience using it? Is it any good?

[0] https://github.com/alphacep/vosk-api

commoner4y ago

Vosk powers Dicio, a free and open source voice assistant for Android. If you have an Android device, this app is another way to try out Vosk:

- F-Droid: https://f-droid.org/packages/org.dicio.dicio_android/

- Source: https://github.com/Stypox/dicio-android

- HN: https://news.ycombinator.com/item?id=29762526

The accuracy of the English language recognition is not bad. I'm glad to see an implementation of Vosk for desktop Linux.

flas9sd4y ago

woodson4y ago

Arnavion4y ago

follower4y ago

Yeah, I was really impressed with the project when I encountered it last year when trying out a bunch of FLOSS Speech-To-Text options.

It was significantly better than the other FLOSS options I looked at--both in terms of getting it going initially & the quality of the speech to text results.

I tested it with a lightly modified version of this example script: https://github.com/alphacep/vosk-api/blob/master/python/exam...

Nimitz144y ago

It's very well known among ppl who know the field. It's quite good, the lead has a nice blog too.

foothebar4y ago

Results depend heavily on which speech files you use. You can even guess which it was, looking at the errors it makes.

yjftsjthsd-h4y ago· 6 in thread

melony4y ago

yjftsjthsd-h4y ago

MisterTea4y ago

johnisgood4y ago

Woah, we really do write 2022.

yjftsjthsd-h4y ago

What?

johnisgood4y ago

You said it yourself: "I don't know why this is a hard bar to clear, but bravo.". It should not be, it should be expected, and it should be done. It should not be a magical or surprising thing.

2 more replies

zelphirkalt4y ago· 6 in thread

Has anyone used this somehow inside Emacs or knows how to make Emacs take its output and put it into a buffer?

dotancohen4y ago

Just open emacs.

This program outputs like a keyboard. And, in English at least, it works really well. I cannot believe it.

tmalsburg24y ago

nshm4y ago

1 more reply

dotancohen4y ago

Reading my comment back, right here in the browser:

  > just open a max this program output like a keyboard and in
  > english at least it works really well i cannot believe it

The only problems that I see are:

1. Capitalization and punctuation.

2. Doesn't know what emacs is, so it got that wrong. A user-installed dictionary might help here.

3. "outputs" came out as "output". I just tried a few more times, and I got the same results. I suspect that like "emacs", the word "outputs" is not in the dictionary.

ideasman42OP4y ago

1 more reply

plafl4y ago

It is a matter of time someone puts a dictation mode together. Given the design of Emacs key combinations you will be able (required) to use also rythm and tone.

kristopolous4y ago· 5 in thread

I'm throwing another hat in the ring as this technology totally working most of the time. I used it to write this comment.

This should make my life a lot easier because I find myself going to my phone and using the dictation feature a lot recently. It's not as good as the one on my android, but it's 95% of the way there.

dotancohen4y ago

Reading your comment with nerd-dictation returns this for me:

  > i'm throwing another hat in the ring as this technology totally working most
  > of the time i used it to write this comment this should make my life ah lot
  > easier because i find myself going to my phone and using third dictation
  > feature a lot recently it's not as good as the one on my android for it's
  > ninety five percent of the way fair

For use with no training that looks great. I'm sure that as I learn to speak more clearly, your 95% estimate is achievable.

ideasman42OP4y ago

dotancohen4y ago

How did you add the punctuation and capitalization?

kristopolous4y ago

That was by hand but it's a small task.

dotancohen4y ago

I'm imagining remapping some VIM shortcuts for even easier capitalizing words and adding common punctuation.

This remap makes the "." key add a period after the previous word, capitalize the current word, then move on to the next word:

  :noremap . bea.<esc>w~w

2Gkashmiri4y ago· 3 in thread

This could be similar to what YouTube does with it's automatic subtitles. What do you guys say?

follower4y ago

The package that this project is built on (vosk-api) mentions & includes some examples to demonstrate exactly that type of use case with ffmpeg:

* https://github.com/alphacep/vosk-api/blob/master/python/exam...

(Edit: Also, thanks for introducing me to "libretranslate", looks like an interesting project.)

2Gkashmiri4y ago

great. someone should link peertube github with this. i am sure the great people will do it much faster and more elegantly :-)

1 more reply

harryvederci4y ago

Sounds great! Do it!

sundarurfriend4y ago· 3 in thread

bruce3434344y ago

dotancohen4y ago

Adding Indian English was definitely doing the needful.

sundarurfriend4y ago

The reason is probably something pragmatic, like perhaps a large enough corpus was available for that specifically.

suifbwish4y ago· 3 in thread

Very cool. Does it have an erotic voice? Asking for a friend.

commoner4y ago

Nerd Dictation does speech-to-text (voice recognition), not text-to-speech. If you want to speak to your computer in an erotic voice, nobody's stopping you.

jancsika4y ago

There is actually an API for this:

evStart: start speaking in an erotic voice

evStop: stop speaking in an erotic voice

evQuery: query whether you are speaking in an erotic voice

evLinuxClassic: enable the inability to speak until Firefox is closed (experimental)

dvh4y ago

Festival and install Scottish or French voice (whatever floats your boat)

allanrbo4y ago· 2 in thread

Nice. Another notable mention in this space is Talon. Useful for automating all OS tasks with voice commands, as well as just dictation: https://talonvoice.com/

yencabulator4y ago

Talon has an EULA that is enough to send me scrambling for the hills: https://talonvoice.com/EULA.txt

Meanwhile, the meat of this speech recognizition is Vosk, which is just Apache-2: https://github.com/alphacep/vosk-api/blob/master/COPYING

dotancohen4y ago

Running it as a user prompts:

  > $ ./run.sh 
  > [+] Prompting for admin to set up Tobii udev rule
  > [sudo] password for dotancohen:

deknos4y ago· 2 in thread

is there an offline good program for text to speech for german,french,spanish,english? and no, festival and espeak are not what i would consider good.

follower4y ago

Yes!

The project is called Larynx, and it is amazing: https://github.com/rhasspy/larynx/

I waxed lyrical about it recently in this thread about private alternatives to Alexa: https://news.ycombinator.com/item?id=29562526

It is miles ahead of any of the other Free/Open Source TTS solutions I've tried, including the ones you mentioned.

(It's still synthesized speech but the output quality is so good and the project is still extremely early days.)

I like Larynx so much that I've been working on an editor for it to assist in "auditioning" & recording speech in a narrative context, e.g. game/film pre-viz.

deknos4y ago

Thanks, i will look it up! Thank you! :)

phantom_oracle4y ago

This is such an amazing technology for the many tech people who are having to deal with hand/finger/elbow issues after extensive usage for years on their keyboards.

I was looking for this type of tech for at least 2 years and I am glad it now exists.

FOSS is amazing!

zoomablemind4y ago

Vosk, is it "wax" in Russian ("воск")?

I think of wax recording rolls - old days CDs, aka Phonograph cylinder:

https://en.m.wikipedia.org/wiki/Phonograph_cylinder

1 more reply

j / k navigate · click thread line to collapse