https://github.com/dannguyen/watson-word-watcher
One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:
https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...
It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:
https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be
The service is free for the first 1000 minutes each month.
But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.
The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.
Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:
https://github.com/dannguyen/watson-word-watcher/tree/master...
http://developer.att.com/apis/speech
Twilio has one that also requires payment:
https://www.twilio.com/docs/api/rest/transcription
It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.
For simple use cases like home automation or desktop automation, I think it's a more practical approach than depending on a cloud API.
[1] https://github.com/kastnerkyle/ez-phones
[2] https://www.reddit.com/r/MachineLearning/comments/3pr4v4/are...
It's the whole "Memory is a process, not a hard drive" thing: Voice recognition as it is today is a slowly evolving graph from input data. You could in theory compress the graph and have it available offline. But it would be hard to chop it up in a way that doesn't completely bust the recognition.
Well, I guess at some point this functionality will become part of the OS. When OSX and Windows offer this, then Linux cannot stay behind, and we will see open source speech recognition libraries.
Are there any academic groups working on this topic, and do they have prototype implementations?
HOWEVER:
The only continuous dictation models available for Julius are Japanese, as it is a Japanese project. This is mainly an issue of training data. The VoxForge models are working towards releasing one for English once they get 140 hours of training data (last time I checked they were around 130); but even so the quality is likely to be far less than commercial speech recognition products, which generally have thousands of hours of training.
Apparently Kaldi is a lot better, but good luck setting it up!
[0] https://jasperproject.github.io/
[1] https://hn.algolia.com/?query=Jasper%20Project&sort=byPopula...
Aside from circumventing lag, I can also give it some personality. I want to name it Marvin, after the robot from H2G2, so that I can say:
"Marvin, turn the TV off"
"Here I am, brain the size of a planet, and you ask me to turn off the tv. Call that job satisfaction, 'cause I don't."
I also had to 'brew install portaudio flac swig' and a bunch of other python libs. By the time it ran, 'pip freeze' returned:
altgraph==0.12
macholib==1.7
modulegraph==0.12.1
py2app==0.9
PyAudio==0.2.9
pyobjc==3.0.4
pyttsx==1.1
SpeechRecognition==3.3.0
pocketsphinx==0.0.9
My fork of the gist is here: https://gist.github.com/ivanistheone/b988d3de542c1bdd6a90Recognizing speech (speech-to-text) with the Python speech module
https://code.activestate.com/recipes/579115-recognizing-spee...
and
Python text-to-speech with pyttsx
https://code.activestate.com/recipes/578839-python-text-to-s...
Good stuff. I like this area.
It is good enough quality and a good start for those who can not afford paying for Google's API.
The link to the VLC library is pretty handy.
All of those libraries have Python 2.7 versions. Actually for all of them you pip install the same library; for pyttsx, `pip install pyttsx` and ignore jpercent's update.
I'm not sure what you mean about pricing and testing for development. Are you referring to Google's services? They offer 50 reqs/day for voice recognition on a free developer API key (https://www.chromium.org/developers/how-tos/api-keys). Google Translate can also be used by gTTS; it will rate limit or block you if you send too many reqs/min or per day without an appropriately registered API key, but you could play around with it for sure.
If voice recognition is important, it might be worth investigating Sphinx more and putting the time to tweak their English language model files. Synthesis is more difficult, though I think the Windows SAPI, OSX NSSS, and ESpeak on *nix are all "good enough." There are also a range of commercial libraries.
After trying to tweak the threshold parameters without success I just figured I'd add a custom key-command to break the listening loop in my project.