I'm convinced that voice is going to be a bigger and bigger part of how we all interact with generative AI. But one thing that's hard, today, is building voice bots that respond as quickly as humans do in conversation. A 500ms voice-to-voice response time is just barely possible with today's AI models.
You can get down to 500ms if you: host transcription, LLM inference, and voice generation all together in one place; are careful about how you route and pipeline all the data; and the gods of both wifi and vram caching smile on you.
Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:
https://fastvoiceagent.cerebrium.ai/
We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.
macOS mic input 40
opus encoding 30
network stack and transit 10
packet handling 2
jitter buffer 40
opus decoding 30
transcription and endpointing 200
llm ttfb 100
sentence aggregation 100
tts ttfb 80
opus encoding 30
packet handling 2
network stack and transit 10
jitter buffer 40
opus decoding 30
macOS speaker output 15
----------------------------------
total ms 759
Everything in AI is changing all the time. LLMs with native audio input and output capabilities will likely make it easier to build fast-responding voice bots soon. But for the moment, I think this is the fastest possible approach/tech stack.We needed a way to measure voice-to-voice latency from the end-user's perspective, and found Silero voice activity detection (https://github.com/snakers4/silero-vad) to be the most reliable at detecting when the user has stopped speaking, so we can start the timer (and stop it again when audio is received from the bot.)
Silero runs via onnx-runtime (with wasm). Whilst it sort-of-kinda works in Firefox, the VAD seems to misfire more than it should, causing the latency numbers to be somewhat absurd. I really want to get it working though! I'm still trying.
The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...
https://mozilla.github.io/standards-positions/
That, and their shitty management shakes my faith in Firefox
I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.
One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."
The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.
I think not everyone would react the same way. For some calling each other bitch is normal talk (which is likely, why I it got into the training data in the first place). For others, not so much.
I'm more pissed if I'm waiting days for a response.
Now our most prolific sales engineer could no longer run demos to potential clients. He had many embarrassing calls where the Ai would just not respond. His last name was Dick.
(You can imagine the instruct layer to be like the skin on a peach. It’s tiny in influence compared to what’s inside. Even more so than, in humans, the cortex vs. the mammalian brain. Whoever tried to tell their kids not to touch the cookies while putting them in front of them and then leaving the room knows that relying on high level instructions is a bad idea.)
One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.
Humans work like this; we use lots of filler words as we sort of get going responding to things.
Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].
Very cool, thank you for sharing this.
From Cerebrium here. Really appreciate the feedback - glad you had a good experience!
This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc
In partnership with Daily, we really wanted to focus on the engineer here. So make it extremely flexible for them to edit the application to suit their use case/preference while at the same time take away the mundane infrastructure setup.
You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents
Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.
I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’
There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.
There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.
GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.
I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!
I do wonder how needed or optimal a single combined model is for latency and cost optimisation.
The breakdown provided is interesting.
I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?
IMHO the desktop environment should provide voice to text as a service with a standard interface to applications - like stdin or similar but distinct for voice. Apps would ignore it by default since they aren't listening, but the transcriber could be swapped out and would be available to all apps.
Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.
"host transcription, LLM inference, and voice generation all together in one place"
I think there are some benefits to going through text rather than using a voice-to-voice model. It creates a 100% reliable paper trail of what the model heard and said in the conversation. This can be extremely important in some applications where you need to review and validate what was said.
I acknowledge there are multiple viable patterns of social interaction, some talk over each other, and find that fun and engaging, while others think that's just the worst, and wait for a clear signal for their turn to speak and expect the same. I am of the latter.
I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.
To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?
So our costs are based on the infra you use to run your application and we charge per millisecond of compute.
Some things to note that we might do differently to other providers: 1. You can specify your EXACT requirements and we charge you only for that. Eg: if you want 2 vCPU, 12GB Memory and 1 A10 GPU we charge you for that which is 35% less if you rented a whole A10 2. We have over 10 variety of GPU chips so you can choose the price/performance trade-off 3. While you can extend this on the Cerebrium platform, it cannot be used commercially. We are speaking to Deepgram to see how we can offer it to customers. Hopefully I can provide more updates on this soon
> jitter buffer [40ms]
Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.
"Oh I think I figured out your secret!"
"Please tell me"
"You achieve the short response times by keeping a short context"
"You're absolutely right"
Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.
The same math holds up today because of a combination of fundamental limits and state of the art limits.
Google also said that the controller would send the input straight to the server.
And a fast stadia server should have good fps combined with a little bit of brain prediction
It's possible to tweak the Opus settings to reduce that encode/decode latency substantially. Which might actually be worth doing for this use case. But there isn't quite a free lunch, here. The default Opus frame size is 20ms. Smaller frames lower the encoding/decoding latency, but increase the bitrate. The implementation in libwebrtc is very well tested and optimized for the default 20ms frame sizes and maybe not so much at other frame sizes. Experience has made me leery of taking the less-trodden-paths without a lot of manual testing.
Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.
https://www.youtube.com/live/hm2IJSKcYvo
hn discussion here: https://news.ycombinator.com/item?id=40866569
I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”
Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).
And this was from a mobile connection in Europe, with a shown latency of just over 1s.
Perfect comprehension and no problem even with bad accents.