So I have turned to a complex and highly unreliable software stack that provides both voice-to-text, and clumsy but limited control of Microsoft Windows, Chrome, etc. This includes Dragon Voice-to-text, Voice Computer, and Talon, plus a browser extension and heavy customization.
Users of Dragon will acknowledge that: a) The software is a creaky dumpster fire built on archaic code b) There is no viable alternative on the market
My question is: *how is it that no one has built something better?* The market is huge, and the Natural Language Processing of "OK Google" and Siri are quite refined at this point.
References:
Dragon: https://www.nuance.com/dragon.html
Voice Computer: https://voicecomputer.com/
Talon: https://talonvoice.com/
To me, I'll take a restricted set of actions if they work very reliably.
This is what I had back in the 80's with a Covox Voicemaster plugged into to the joystick port of my Commodore 64. It could only understand a few phrases, but I could define those phrases, and it almost always worked.
If you define "high quality" as being able to respond to a seemingly infinite number of queries, but only understanding and replying correctly occasionally, then Siri is closer to what you want.
I have a handful of Echo Dots and Shows in places I don't mind the security risk, and they are maddeningly incompetent at doing anything in the real world other than telling the weather and acting as a voice-controlled radio (their main use...)
It would be interesting to go back to the Covox approach and rebuild it for today's tech from the ground up (shouldn't need the hardware anymore...), as it worked surprisingly well on computers that had less resources than many (most?) of today's microcontrollers...
"Send a slack to my wife" -> "Sorry, who do you want to text?"
Multi-fail.I also have a friend who is a gifted programmer who lost his ability to type about a decade ago; he has put together an open-source software stack to help: http://www.cs.columbia.edu/~dwk/
Of course this doesn't really answer your question. But it's a hard problem, and you're basically forced to become a power user to reliably interact with your PC.
This reminds me of easy motion for vim or ace-jump for emacs.
Do you think it would be possible to have an on-demand contextual hat decoration?
Like you say “show hats words” and only words get decorated with hats and you pick one. It would allow you to maybe show hats only on square brackets or only on function arguments, etc. I find the number of hats with colors a little bit hard to distinguish; should they were contextual, they would require fewer or no color.
Do you map voice commands to keyboard shortcuts available in vs code, or directly via the apis? (Not sure if there is a difference in the end).
Now I wish for a cursor less plugin on the IntelliJ platform.
I think it's a lot faster than keyboard / mouse, mostly because of how little moving of the cursor you have to do.
Could be I was slow to begin with, not super efficient with vim or emacs.
Also, "editing" is the fastest part for me, due to "bring" and "change". So little movement.
I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)
Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.
It's interesting you mention that google isn't good at dictation as I've found it excellent on pixel 6 (maybe the quality varies depending on what hardware you're running on?) If I need to write out anything over a sentence or two on my phone I'll almost always dictate it and as long as I have a reasonable idea what I want to say beforehand it works well.
What I personally find a little jarring is that I find I need to compose what I want in my head further in advance than I would if Im typing as correcting mistakes is more awkward.
I tend to over-enunciate, so I don't get many bad bugs in the parsing... but that doesn't stop the Google Assistant from delivering completely the wrong response to the words that it's showing me it has correctly recognized, or simply spinning endlessly and locking up my phone.
As an industry, we suck at everything. We've solved the hard problem but failed the easy part of "once the command has been parsed, either execute the action or show the user an error and then close the dialog".
The only thing I find really awful about speech-to-text on Google is that it can't seem to detect punctuation.
On the other hand, throwing an additional language model like GPT and BERT into the mix can help if you don't have a ton of voice data. In my attempt to do this, a large portion of the improvement came from letting the language model read the previous sentences in the conversation[2]. AFAIK most commercial systems are blissfully unaware of your previous sentences, leading to conversations like "set an alarm"/"sure when?"/"eightam"/"your nearest ATM is...".
A word of caution though: letting BERT/GPT edit the outputs also gives a (potentially) much more dangerous failure mode: if the speech signal is difficult to understand, the resulting transcript will be difficult for humans to identify as transcription failures.
For example, "yeah, I dunno I haven't..." (read on a noisy phone line in an obscure dialect) was transcribed as "yeah yeah not that is I I am then" by the baseline speech system. After we let BERT edit the outputs, the transcript became "yeah that's not what I was saying...". Which, ironically, was definitely not what the person was saying.
[1] https://arxiv.org/abs/1911.08460, page 9
[2] https://arxiv.org/abs/2110.02267
edit: clarify why previous sentences matter
I think there are multiple reasons:
* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.
* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.
* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.
* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.
So, moral of the story, if you do a too good job of making a fast speech engine, especially for multi-turn dialogues, add some delays so it resembles human dialogue more.
At some point when I have enough free time I will have to take a look at this! Thanks for putting time into this kind of thing!
So to set a kitchen timer: "wake me up in 11 minutes"
Windows Voice Recognition has been around forever (out of the box since XP), it's UI is "serviceable" but not great. (It was slightly better when Cortana was briefly "out of the box" in Windows 10, but has reverted some since.) But I don't think you need to pay for Dragon (or its high memory consumption) if you don't mind taking to learn the quirks of Windows Voice Recognition directly. Most of Dragon's quirks are Windows' quirks anyway papered over with a UI that makes it seem like they are adding value.
Also yeah, one of the answers to "how is it that no one has built something better?" is: Well, Microsoft tried with Cortana, got a huge blowback that "no one" wanted Cortana on their PCs, and gave up.
It works very well for some people; many have written books with it.
This is fairly insulting as RSI’s are very much a real thing.
Does this community also think that wheelchair ramps should never be invested in because stairs are clearly superior?
I’d rather see the brain power in this community focused on solutions. Keyboard + mouse have lasted so long because they work surprisingly well, but I hope there is a day that we dream up something better that does not require slowly giving ourselves carpel tunnel.
Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.
Background: I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).
I bootstrapped writing it initially using the Dragonfly WSR (windows speech recognition) backend, because that gave me the best accuracy out of the available options at the time. All of my development of it since the initial working version has been done using each previous version, so now it is basically bootstrapped itself. My productivity skyrocketed once I switched to Kaldi, due to being able to use my custom trained speech model just for my voice for orders of magnitude better accuracy, plus dramatically lower latency. (And it freed me from being dependent on closed software out of my control.)
I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model dramatically more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!
Simple dictation could be done at the DE level, where the VtoT stream would be diverted to the keyboard input of the active app. It could also be done at the app level, but this is one feature I think belongs a level up so it can be used by non-voice enabled apps.
For keyboards, you lose positional logic (wasd in games). You lose shortcuts. You lose control over capitalization and formatring. You lose punctuation. You lose non-text input (code, dictating code sounds like like a horrible pain). You lose function keys. And, of course, you lose speed (think of instant things you do with shortcut keys, like alt tab). Not to mention, that you lose the ability to work in silence.
Make the recognition quality gorgeous, and it will still be a less flexible product than what we use today. It has value for accessibility, but people will likely choose keyboards over dictation based on UX alone.
Dictation, for instance, is an easy-win for voice input. Clicking buttons can be more convenient with voice when we're talking to Smart TVs or, perhaps, if our hands have pizza grease all over them and we don't want to touch the keyboard.
For detailed work though the more direct method of translating movements is far more efficient.
When you can describe an abstract end goal voice is great. When you have to actually do all the individual steps towards some high level goal then it's like telling a newbie programmer through some high level database optimization. You only use voice because your main goal here is to teach someone. If the PC could be taught that way, then voice would be in demand for such tasks too.
I also have such a throwaway-but-real account in my "About" under my user name here, just added it. Should have done that anyway, you just reminded me that I should.
Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.
This comment speaks to a perception problem for aural methods. The state of the mainstream art doesn't seem much past Forstall's demo of 10 years ago. [0] Are generations of people accustomed to WIMP UI able to wrap their heads around a much smaller interaction set? [1]
Gentner and Nielsen's work described in "The Anti-Mac Interface" [2] speaks to some of the differences people will have to mentally bridge such as:
Mac | Anti-Mac
Direct Manipulation | Delegation
See and Point | Describe and Command
WYSIWYG | Represent Meaning
User Control | Shared Control
Feedback and Dialog | System Handles Details
Forgiveness | Model User Actions
0. https://www.youtube.com/watch?v=SpGJNPShzRc1. https://en.wikipedia.org/wiki/Post-WIMP
2. https://web.archive.org/web/20120904231532/http://www.useit....
Seems fairly reasonable. It need not be the only way, but not having to use my mouse to do stupidly simple tasks wouldn't break my heart.
Take “Open Hacker News” for example. One user might Click Browser > Open bookmarks tab > “Hacker News”.
Another, having set up a series of hotkeys, will go (on a windows machine, taskbar set for Browser pinned in position 1):
Win+1 > Ctrl+3
That is incredibly fast, much faster than saying it.
My guess is that much of the software engineering world is either users who can do the first very quickly or don’t find it cumbersome, or users who set up hotkeys like the latter and will outrace the speed of human speech on any given day. Thus the problem gets little attention.
Here's some examples where bandwidth and latency wins with speech:
1. "Play here comes the sun" vs. opening spotify, waiting, clicking the search box, typing here comes the sun, pressing enter, waiting, scanning the page and clicking the right song.
2. "Send email to John asking him if he would like to Play golf" vs. opening Gmail, waiting, clicking compose, start typing john, click the right email, tab to subject... etc.
There are cases where keyboard and mouse input is better... e.g. editing text, graphics production and editing, etc.. But certainly not in "almost all tasks" as you say. I think speech is the 3rd big computer interface that complements the mouse and keyboard and will make computers more productive and convenient for everyone regardless if you have a disability.
Which John? Which of that John's contact points you have saved?
..and why don't you have the keyboard shortcuts for those actions committed to muscle memory by now?
Even shortcuts (which peer comments are relying upon) aren't all that fast - they require additional selection movement with the keyboard or mouse before they can be used.
Vs properly enunciating "Kah-Pee f-i-l-e-1 to f-i-l-e-2"
You need dedicated software built on a hypothetical V(oice)UI to get anything decent.
Otherwise your best bet is to find a mouse/trackball/trackpad/pointerstick/touch-screen/pen that doesn't injure you and use text-to-speech in simple text editors.
My finding, for text dictation (not code), is that even halfway decent dictation, such as is available on iPhone, still needs much post-dictation editing. I feel that the biggest impact to be made in this area is superior capabilities for this editing phase.
I summarized and wrote up my thoughts as a grant proposal for Scott Alexander's recent "micro grants" project. Get in touch (email in my profile) if you want to read that, or if you'd like to talk about dictation, voice control, voice coding, and editing operations -- or just get some moral support.
This is a well known pit in statistics, I would think, given there are extremely famous stories about this exact issue causing deaths. In the 1950's, the air force was trying to figure out why their pilots were dying, and determined it was because their cockpit designs which used "average" pilots were a poor fit for almost ever real world pilot.[1]
1: https://www.thestar.com/news/insight/2016/01/16/when-us-air-...
That said, I too have wondered why we don't have speech control for computers or at least appliances.
You don't need to parse all language. Just a standard set of primitives like you'd find on a remote should be way easier to recognize and can even be selected for their ease of parsing. Simple things like on, off, next, back, louder, etc.
The complexity of doing that is IMO a good explanation of why commercial audio recognition is worthless to someone who programs a computer instead of interacts with humans over a computer.
http://plover.stenoknight.com/2013/03/using-plover-for-pytho...
You can DL it here: https://chrome.google.com/webstore/detail/lipsurf-voice-cont...
People have built the tools you're talking about. They're Talon and Cursorless.
I think you'd be shocked if you saw how productive some people in the Talon community are. Be sure to join the community Slack.
You have just been fired and as the security boys are escorting you to the door, you call out, loud enough to be heard in all the cubicles -
"Computer! Format all drives!"
OR MAYBE THIS OTHER SCENARIO:
The guy in the next cubicle has a loud voice and while he is commanding his own computer to "Exit the file without saving" you find that the work you have carefully constructed over the last four hours is suddenly thrown away too.
I bootstrapped my personal speech model by retaining the commands from me using WSR. My voice is quite abnormal, and it took only 10 hours of speech data to train a model orders of magnitude more accurate than any generic model I've ever used. And of course, I retain much of my usage now with Kaldi, so my model improves more and more over time. A virtuous flywheel!
But, it seems like all voice control development keeps getting bought up by the Big 3, so it's not likely to have any significant breakthroughs independent of what Apple, Google and Amazon think voice control is good for.
It’s an alternative input method. Might be worth giving a try.
Voice Finger by Cozendy [$9.99]
Lenovo Voice Control from msstore [free]
Amazon Alexa from msstore [free]
"Win Key + h" for the inbuilt text box dictation [inbuilt]
serenade.ai [$$]
I don't have an exact answer to you OP but I hope someone builds a helpful one for you.
There is a voice assistant ap for Android that uses vosk called Dicio (f-droid). Storage is cheap and easy. Processing power is there even in cheap 3rd world phones. I personally detest typing and would love to talk to my devices without any 3rd party nonsense requirements. Truly there is none because the powers that be do not want everyone thinking they are in control, essentially of anything.
With that precondition, any voice-to-control layer on the desktop is in the tough situation of translating between voice input and a piece of software that was designed without voice input in mind.
Google and Siri, etc., aren't as beholden to the desktop/browser interface paradigm, so they don't have to perform this interface translation.
"Open this program"
"Minimize"
"Focus on this text input"
..dictate..
"switch to command mode"
"save and close"
i'd rather just: "click click tab type ctrl-S"
"move mouse to this 100x100 pixel square, click twice within 100ms"
"move mouse to this 20x20 pixel square, click once"
"Move mouse to this 100x1000 pixel rectangle, click once"
"Type text at a rate 1/5 (1/2 if you're particularly fast) speaking rate"
"Move your pinkie (your weakest finger) to 'ctrl' and click, move your index to 's' and click, release both, verify it worked with a visual cue then either another 20x20 mouse maneuver, or "move your thumb to 'alt', and your index finger to 'f4' (assuming you have access to the function keys), click and release"
Moving a mouse to a very specific spot on a screen is a relatively slow - and hard if you have any motor control issues - task.
I presumed they meant more than the extreme edge of RSI sufferers. So I ran the thought experiment.
I've had a mild RSI. The solution was get fancy ergo mouse/keyboard/desk/chair, and retrain myself. I've even seen a guy use a joystick instead of a mouse.
It's not
It's tiny (at best)
Tiny markets don't tend to get much attention
There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.
I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.
That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.
I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.
In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.
Apparently ... it's not
Or, rather, it's not YET "huge"
Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable
>the Natural Language Processing of "OK Google" and Siri are quite refined at this point
Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))
It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...
It's a nightmare to think about - practically, let alone computationally
Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it
That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]
Maybe in another few decades or centuries ... but I'd wager probably not
Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?
It's true that computer control currently requires a lot of customization, but I see no practical reason why we can't at least make simple commands fast and accurate, i.e., 'create new html document in VS Code'.
The thing about voice is how weak it is. Even if you've well trained it and you speak well, which i don't. It wont be as good as a keyboard.
Putting work into voice like this for productivity is pointless. Any effort is best placed in brain computer interfaces. Hopefully not surgically required, like neurolink is doing. More of a headset like Valve and openbci is doing.
Lets just wear a headset and work, keyboards can just be there in case you need them.
It's also super weird to speak to a computer. Typing, touching or thinking are all fine, but somehow sitting in a room talking to a machine is a bit weird, even though it's not weird if I'm on a call, I can't explain it. Are there others with similar experience?