User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.
The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.
It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.
The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.
> Computer application software that may be downloaded via global computer networks and electronic communication networks for use in connection with mobile computers, mobile phones, and tablet computers, namely, software for use as a voice controlled personal digital assistant
Um what about Jeeves? oh yeah nope.
Ok we need more butler names.
What about Smithers? Or Jeffrey?
Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.
I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time
Do you happen to know anything about any open source voice identification software?
I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.
Another idea I’ve had is using something professional with PTT:
https://sheepdogmics.com/products/quick-disconnect-mic-tubel...
That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!
https://github.com/KoljaB/LocalAIVoiceChat
This one looks similar to this but also does internet search, mail, music, smart home etc.. Hope that there is a standard interface that gets plugged into it. so anyone can develop addons.
And of course Ultron is an asshole, it was trained on input from Tony Stark!
Back in 2008/9 I wondered just what would be required to run JARVIS, something you could converse with naturally, would understand what you meant, and be able to take care of complex mechanical tasks. The Iron Man suits have always been mostly Do-What-I-Mean (DWIM) managed by JARVIS or other AI agents, and now all of that seems to be attainable.
It's going to be an interesting time discovering just how well a human and AI agent can work together. I could see a military personal spotter, keeping track of enemy combatants, managing larger awareness of the battlefield, etc. I wonder how much a soldier could safely offload?
Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here: https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...
There's a French defense company, Preligens, that actually does this currently
More of a framework to perform the general purpose task of "recognize things in 30 (60? 120?) frames per second video and act on events in the video"
Not sure if anyone has noticed but OpenAI now has a mobile app (I've been using the PWA all this time) and the voice assistant on there is really strong. Sounds good, fast, and seems to even run a pass on my voice before it submits the query.
It doesn’t synthesize voice back (yet) but open source and runs all offline on ESP32-based hardware and works with HomeAssistant!