Show HN: Voice bots with 500ms response times (opens in new tab)

(fastvoiceagent.cerebrium.ai)

315 pointskwindla1y ago99 comments

Last year when GPT-4 was released I started making lots of little voice + LLM experiments. Voice interfaces are fun; there are several interesting new problem spaces to explore.

I'm convinced that voice is going to be a bigger and bigger part of how we all interact with generative AI. But one thing that's hard, today, is building voice bots that respond as quickly as humans do in conversation. A 500ms voice-to-voice response time is just barely possible with today's AI models.

You can get down to 500ms if you: host transcription, LLM inference, and voice generation all together in one place; are careful about how you route and pipeline all the data; and the gods of both wifi and vram caching smile on you.

Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:

https://fastvoiceagent.cerebrium.ai/

We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.

  macOS mic input                 40
  opus encoding                   30
  network stack and transit       10
  packet handling                  2
  jitter buffer                   40
  opus decoding                   30
  transcription and endpointing  200
  llm ttfb                       100
  sentence aggregation          100
  tts ttfb                        80
  opus encoding                   30
  packet handling                  2
  network stack and transit       10
  jitter buffer                   40
  opus decoding                   30
  macOS speaker output           15
  ----------------------------------
  total ms                       759

Everything in AI is changing all the time. LLMs with native audio input and output capabilities will likely make it easier to build fast-responding voice bots soon. But for the moment, I think this is the fastest possible approach/tech stack.

Show HN: Voice bots with 500ms response times

(fastvoiceagent.cerebrium.ai)

315 pointskwindla1y ago99 comments

Last year when GPT-4 was released I started making lots of little voice + LLM experiments. Voice interfaces are fun; there are several interesting new problem spaces to explore.

Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:

https://fastvoiceagent.cerebrium.ai/

We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.

  macOS mic input                 40
  opus encoding                   30
  network stack and transit       10
  packet handling                  2
  jitter buffer                   40
  opus decoding                   30
  transcription and endpointing  200
  llm ttfb                       100
  sentence aggregation          100
  tts ttfb                        80
  opus encoding                   30
  packet handling                  2
  network stack and transit       10
  jitter buffer                   40
  opus decoding                   30
  macOS speaker output           15
  ----------------------------------
  total ms                       759

99 comments

83 comments · 36 top-level

geofffox1y ago· 9 in thread

I use Firefox... still.

makeitmore1y ago

Hi, I built the client UI for this and... yea, I really wanted to get Firefox working :(

We needed a way to measure voice-to-voice latency from the end-user's perspective, and found Silero voice activity detection (https://github.com/snakers4/silero-vad) to be the most reliable at detecting when the user has stopped speaking, so we can start the timer (and stop it again when audio is received from the bot.)

Silero runs via onnx-runtime (with wasm). Whilst it sort-of-kinda works in Firefox, the VAD seems to misfire more than it should, causing the latency numbers to be somewhat absurd. I really want to get it working though! I'm still trying.

The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...

stavros1y ago

Do you know why there's a difference in the performance of the algorithm in another browser? I would expect that all browsers run the code exactly the same way.

4mitkumar1y ago

Do not go by the warning message. It does work just fine on Firefox latest. Cool, demo, btw!

panja1y ago

I hate that everyone just develops for chromium only

darren_1y ago

This site works fine in safari/mobile safari, it is not ‘chromium only’

1 more reply

RockRobotRock1y ago

Mozilla refuses to implement some really cool standards.

https://mozilla.github.io/standards-positions/

That, and their shitty management shakes my faith in Firefox

3 more replies

sa-code1y ago

Likely a lot of people on HN use Firefox

chungus1y ago

It is working perfectly for me on Firefox (version 127).

makeitmore1y ago

Thanks for sharing. I did make some changes that seems to have improved things, although I do still see the occasional misfire. Perhaps good enough to remove that ugly red banner though!

firefoxd1y ago· 7 in thread

Well that was fast. Kudos, really neat. Speed trumps everything else. I only noticed the robotic voice after I read the comments.

I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.

One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."

The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.

lukan1y ago

"The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake."

I think not everyone would react the same way. For some calling each other bitch is normal talk (which is likely, why I it got into the training data in the first place). For others, not so much.

9999000009991y ago

If I'm used to waiting 2 days, and you get it down to 30 seconds you can call me what ever you want.

I'm more pissed if I'm waiting days for a response.

1 more reply

jstanley1y ago

It's also possible that it's such an unlikely thing to hear that she actually misheard it and thought it said something nicer.

1 more reply

firefoxd1y ago

Fun fact, we fixed this issue by adding a #profanity tag and dropping the message to the next human agent.

Now our most prolific sales engineer could no longer run demos to potential clients. He had many embarrassing calls where the Ai would just not respond. His last name was Dick.

leobg1y ago

I find it odd that your engineer would make the system rely on instructions (“Do this. Never do that.”). This exposes your system to inconsistencies from the instruct tuning and future changes thereof by OpenAI or whoever. System prompts and instructions are maybe great for demos. But for a prod system where you have to cover all the bases I would never rely on such a thin layer of control.

(You can imagine the instruct layer to be like the skin on a peach. It’s tiny in influence compared to what’s inside. Even more so than, in humans, the cortex vs. the mammalian brain. Whoever tried to tell their kids not to touch the cookies while putting them in front of them and then leaving the room knows that relying on high level instructions is a bad idea.)

bedel231y ago

I wonder if the solution is to run the message through another LLM to make the message as polite as possible removing any profanities. Cost >2x as much to run though.

asjir1y ago

Maybe that was their first name, at least the one they put in lol

vessenes1y ago· 4 in thread

This is so, so good. I like that it seems to be a teaser app for cerebrium, if I understand it. It has good killer app potential. My tests from iPad ranged from 1400ms to 400ms reported latency; in the low end, it felt very fluid.

One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.

Humans work like this; we use lots of filler words as we sort of get going responding to things.

Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].

Very cool, thank you for sharing this.

za_mike1571y ago

Hi Vessenes

From Cerebrium here. Really appreciate the feedback - glad you had a good experience!

This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc

In partnership with Daily, we really wanted to focus on the engineer here. So make it extremely flexible for them to edit the application to suit their use case/preference while at the same time take away the mundane infrastructure setup.

You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents

vessenes1y ago

Thanks for this reply. Yep, as an engineer, this is awesome, the docs look simple and I’ll give it a whirl. As a product guy, it seems like it would be dead simple to start a company on this tech by just putting up a web page that lets people pick a couple choices and gives them a custom domain. Very cool!

c0brac0bra1y ago

I've wondered about this as well. Is there a way to have a small, efficient LLM model that can estimate general task complexity without actually running the full task workload?

Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.

vessenes1y ago

For sure: in fact MoE models train such a router directly, and the routers are not super large. But it would also be easy to run phi-3 against a request.

I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’

There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.

az2261y ago· 4 in thread

Your marketing says 500 but your math says 759.

dietr1ch1y ago

That's called marketing

vessenes1y ago

My tests had one outlier at 1400ms, and ten or so between 400-500ms. I think the marketing numbers were fair.

whizzter1y ago

500 are the transcription/llm/tts steps (ie the response time from data arriving on the server to sending back), the rest seems to be various non-AI "overheads" such as encoding, network traffic,etc.

vr000m1y ago

The latencies in the table are based on heuristics or averages that we’ve observed. However, in reality, based on the conversation, some of the larger latency components can be much lower.

aussieguy12341y ago· 4 in thread

Fast yes, but the voice sounds robotic.

bombela1y ago

I prefer a slighty robotic voice. This was way I know I am talking to a bot, and this sets expectations.

lofties1y ago

Typical HN comment. Absolutely incredible tech is displayed that honestly, one year ago nobody could've imagined. Yet people still find something to moan about. I'm sure the authors of the project, who should be very proud, are fully aware the voice is robotic.

kwindlaOP1y ago

Voice models are getting both faster and more natural at a, well, a fast clip.

cloudking1y ago

It's literally a robot

luke-stanley1y ago· 3 in thread

A cross-platform browser VAD module is: https://github.com/ricky0123/vad. This is an ONNX port of Silero's VAD network. By cross-platform, I mean it works in Firefox too. It doesn't need a WebRTC session to work, just microphone access, so it's simpler. I'm curious about the browser providing this as a native option too.

There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.

I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

The breakdown provided is interesting.

I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?

phkahler1y ago

>> I'm curious about the browser providing this as a native option too.

IMHO the desktop environment should provide voice to text as a service with a standard interface to applications - like stdin or similar but distinct for voice. Apps would ignore it by default since they aren't listening, but the transcriber could be swapped out and would be available to all apps.

regularfry1y ago

If you do stt and tts on the device but everything else remains the same, according to these numbers that saves you 120ms. The remaining 639ms is hardware and network latency, and shuffling data into and out of the LLM. That's still slower than you want.

Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.

charlesyu1081y ago

Lol this announcement blows what ive been working on out of the water but i have a simple assistant implementation with rick0123/VAD + Websockets.

https://github.com/charlesyu108/voiceai-js-starter

spuz1y ago· 3 in thread

It's not exactly clear is this a voice-to-voice model or a voice-to-text-to-voice model? When it is finally released, OpenAI claim their GPT4o audio model will be a lot faster at conversations because there's no delay to convert from audio to text and back to audio again. I'm also looking forward to using voice models for language learning.

kwindlaOP1y ago

Full technical write-up here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

pavlov1y ago

It's a voice-to-text-to-voice approach, as implied by this description:

"host transcription, LLM inference, and voice generation all together in one place"

I think there are some benefits to going through text rather than using a voice-to-voice model. It creates a 100% reliable paper trail of what the model heard and said in the conversation. This can be extremely important in some applications where you need to review and validate what was said.

isaacfung1y ago

There are way more text training data than voice data. It also allows you to use all the benchmarks and tool integrations that have already been developed for LLMs.

SubiculumCode1y ago· 3 in thread

A chatbot that interrupts me even faster. Sorry for the sarcasm. maybe im just slow, but when I'm trying to formulate a question on the spot, I pause a lot. having the chatbot jump in and interrupt is frustrating. Humans recognize the difference between someone still planning on saying something, and when they've finished. I even tried to give it a rule where it shouldn't respond until I said "The End", and of course it couldn't follow that instruction.

makeitmore1y ago

Very true. I think we are a bit aggressive with the VAD timeout. The demo was intended to showcase speed, but the bot can be a bit eager! You can tinker with the VAD settings, it could definitely use a bit more air (but that will impact latency in the event the user has indeed finished talking.) As others say below, the magic will be figuring out the pace and style in which the user talks and adapting to that on the fly.

SubiculumCode1y ago

ps. The speed is impressive, but the key to a useful voice chatbot (which I've never seen) is one that adapts to your speaking style, identifies and employs turn-taking signals.

I acknowledge there are multiple viable patterns of social interaction, some talk over each other, and find that fun and engaging, while others think that's just the worst, and wait for a clear signal for their turn to speak and expect the same. I am of the latter.

SubiculumCode1y ago

I'm sure that, with an annotated dataset, a model could learn to pick up on the right cues.

c0brac0bra1y ago· 2 in thread

I've been developing with Deepgram for a while, and this is one of the coolest demos I've seen with it!

I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.

To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?

za_mike1571y ago

Hey! From the Cerebrium team here!

So our costs are based on the infra you use to run your application and we charge per millisecond of compute.

Some things to note that we might do differently to other providers: 1. You can specify your EXACT requirements and we charge you only for that. Eg: if you want 2 vCPU, 12GB Memory and 1 A10 GPU we charge you for that which is 35% less if you rented a whole A10 2. We have over 10 variety of GPU chips so you can choose the price/performance trade-off 3. While you can extend this on the Cerebrium platform, it cannot be used commercially. We are speaking to Deepgram to see how we can offer it to customers. Hopefully I can provide more updates on this soon

c0brac0bra1y ago

Excellent; thanks for the info.

amluto1y ago· 2 in thread

Maybe silly question:

> jitter buffer [40ms]

Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.

Olreich1y ago

Almost any gap in audio is detectable and sounds really bad. 40ms is a lot, but sending 40ms of silence is probably worse

amluto1y ago

Sounds bad to whom? I’m talking about the direction from user to AI, not the direction from AI to user. If some of the audio gets delayed on the way to the AI, the AI can be paused. If some of the audio gets delayed on the way to a human, the human can’t be paused, so some buffering is needed to reduce the risk of gaps.

_def1y ago· 1 in thread

This was fun to try out. Earlier this week I tried june-va and the long response time kind of killed the usefulness. It's a great feature to get fast responses, this feels much more like a conversation. Funny enough, I asked it to tell me a story and then it only answered with one sentence at a time, requiring me to say "yes", "aha", "please continue" to get the next line. Then we had the following funny conversation:

"Oh I think I figured out your secret!"

"Please tell me"

"You achieve the short response times by keeping a short context"

"You're absolutely right"

danielbln1y ago

That works for me, to be honest. not the short context, but definitely the short replies. Contrast that with the current implementation of ChatGPT's voice mode, where you ask something and then get a minute worth of GPT bla bla.

andrewmcwatters1y ago· 1 in thread

I love it when engineers worth their salt actually do the back-of-the-envelope calculations for latency, etc.

Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.

The same math holds up today because of a combination of fundamental limits and state of the art limits.

Flumio1y ago

The calculations I was reading at the time suggested it would work for casual due to the gaming PC being very close to the game servers and running inside the best network available (googles).

Google also said that the controller would send the input straight to the server.

And a fast stadia server should have good fps combined with a little bit of brain prediction

hackerbob1y ago· 1 in thread

This is indeed fast! Also seems to be no issue interrupting it while speaking. Is this using WebRTC echo cancellation to avoid microphone and speaker audio mix ups?

makeitmore1y ago

Yes, echo cancellation via the browser (and maybe also at OS-level too, if you're on a Mac with Sonoma.) The accuracy of speech detection vs. noise is largely thanks to Silero, which runs on the client via WASM. I'm surprised at how well it works, even in noisy environments (and a reminder that I should experiment more with AudioWorklet stuff in the future!)

yjftsjthsd-h1y ago· 1 in thread

Dumb question - I see 2 opus encodes and decodes for a total around 120ms; is opus the fastest option?

kwindlaOP1y ago

Yes, Opus is the fastest and best option for real-time audio. It was designed to be flexible and to encode/decode at fairly low latencies. It sounds good for narrow-band (speech) at low bitrates but also works well at higher bitrates for music. And forward error correction is part of the codec standard.

It's possible to tweak the Opus settings to reduce that encode/decode latency substantially. Which might actually be worth doing for this use case. But there isn't quite a free lunch, here. The default Opus frame size is 20ms. Smaller frames lower the encoding/decoding latency, but increase the bitrate. The implementation in libwebrtc is very well tested and optimized for the default 20ms frame sizes and maybe not so much at other frame sizes. Experience has made me leery of taking the less-trodden-paths without a lot of manual testing.

yalok1y ago· 1 in thread

you may be double counting opus encoding/decoding delay - usually, you can run it with 20ms frame, and both encoder and decoder take less than 1ms of realtime for their operation - so it should be ~ 21ms, instead of 30+30ms for 1 direction.

kwindlaOP1y ago

You are right! Thank you. I went back and looked at actual benchmark numbers from a couple of years ago and the numbers I got were ~26ms one-way. I rounded up to 30 to be conservative, but then double-counted in the table above. Will fix in the technical write-up. I don't think I can edit the Show HN.

spark_chicken1y ago· 1 in thread

i have tried it. it is really fast! I know making a real-time voice bot is not easy with this low latency. which LLM did you use? how large LLM to make the conversation efficient?

makeitmore1y ago

This particular demo is using Llama3 8B. We initially started 70B, but it was a touch slower and needed much more VRAM. We found 8B good enough for general chit-chat like in this demo. Most real-world use-cases will likely have their own fine-tuned models.

mdbackman1y ago

Very, very impressive! It's incredibly fast, maybe too fast, but I think that's the point. What's most impressive though is how the VAD and interruptions are tuned. That was, by far, the most natural sounding conversation I've had with an agent. Really excited to try this out once it's available.

trueforma1y ago

I too am excited about voice inferencing. I wrote my own Websocket Faster whisper implementation before OpenAI's gpt4o release . They steamrolled my interview coach concept https://intervu.trueforma.ai and https://sales.trueforma.ai - sales pitch coach implementations. I defaulted to Push to talk implementation as I couldn't get VAD to work reliably. I run it all on a panda Latte :) Was looking to implement Groq's hosted whisper. I love the idea of having Llama3 uncensored on Groq as the LLM as I'm tired of the boring corporate conversations. I hope to reduce my latency and learn from your examples - Kudos to your efforts. I wish I could try the demo - seems to be over subscribed as I can't get in to talk to the bot. I'm sure my latte Panda would melt if just 3 people try to inference at the same time :)

asjir1y ago

Personally, I use https://github.com/foges/whisper-dictation with llama-70b on groq. I start talking, navigate to website, and by the time it's loaded, and I picked llama-70b I finish talking, so 0 overhead. I read much faster than listen, so it works for me perfectly.

andrewstuart1y ago

Damned impressive.

Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.

realyashnag1y ago

This was scary fast. Neat interface and (almost) indistinguishable from a human over the phone / internet. Kudos @cerebrium.ai.

etherealG1y ago

moshi by Kyutai seems to have beaten your approach by about 500ms, and they're going to release open source.

https://www.youtube.com/live/hm2IJSKcYvo

hn discussion here: https://news.ycombinator.com/item?id=40866569

dijit1y ago

I’m genuinely shocked by how conversational this is.

I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”

anonzzzies1y ago

This is pretty amazing ; it’s very fast indeed. I don’t really care about the voice responding sounding robotic; low latency is more important for whatever I do. And you can interrupt it too. Lovely.

mmcclure1y ago

Wow, Kwin, you’ve outdone yourself! The speed makes an even bigger difference than I expected going in.

Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).

_DeadFred_1y ago

This is super cool. Thanks for sharing. And I'm excited it encourage other to share. I'm excited to spend some time this weekend looking at the different ways people in this thread implemented solutions.

jaybrendansmith1y ago

This thing is incredible. It finished a sentence I was saying.

gsjbjt1y ago

That's awesome - can you say anything about what datasets this was trained on? I assume something specifically conversational?

tamimio1y ago

Or we can say the latency is a good listening skills!! It was fast but occasionally interrupted me to answer.

andruby1y ago

This is really good. I'm blown away by how important the speed is.

And this was from a mobile connection in Europe, with a shown latency of just over 1s.

p_frank1y ago

Amazing to see the metrics of each part that is involved! I've wondererd why you could not introduce a small sound that overplays the waiting time? Like an "hmm" to skip a few 100ms of the response time? Could be pregenerated (like 500 different versions) and play after 200ms of the last users input.

sumedh1y ago

This is very impressive, me and my kid had fund talking about space.

ftth_finland1y ago

This is excellent!

Perfect comprehension and no problem even with bad accents.

isoprophlex1y ago

Jesus fuck that's fast, and I had no idea speed mattered that much. Incredible. Feels like an entirely different experience than the 5+ seconds latency with openai.

preciousoo1y ago

This is so cool!

Borborygymus1y ago

It /was/ nice and quick. Thanks for putting the demo online. It was quick to tell me complete nonsense. Apparently 7122 is the atomic number of Barium.

j / k navigate · click thread line to collapse

99 comments

83 comments · 36 top-level

geofffox1y ago· 9 in thread

I use Firefox... still.

makeitmore1y ago

Hi, I built the client UI for this and... yea, I really wanted to get Firefox working :(

The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...

stavros1y ago

Do you know why there's a difference in the performance of the algorithm in another browser? I would expect that all browsers run the code exactly the same way.

4mitkumar1y ago

Do not go by the warning message. It does work just fine on Firefox latest. Cool, demo, btw!

panja1y ago

I hate that everyone just develops for chromium only

darren_1y ago

This site works fine in safari/mobile safari, it is not ‘chromium only’

1 more reply

RockRobotRock1y ago

Mozilla refuses to implement some really cool standards.

https://mozilla.github.io/standards-positions/

That, and their shitty management shakes my faith in Firefox

3 more replies

sa-code1y ago

Likely a lot of people on HN use Firefox

chungus1y ago

It is working perfectly for me on Firefox (version 127).

makeitmore1y ago

Thanks for sharing. I did make some changes that seems to have improved things, although I do still see the occasional misfire. Perhaps good enough to remove that ugly red banner though!

firefoxd1y ago· 7 in thread

Well that was fast. Kudos, really neat. Speed trumps everything else. I only noticed the robotic voice after I read the comments.

I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.

One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."

The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.

lukan1y ago

"The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake."

I think not everyone would react the same way. For some calling each other bitch is normal talk (which is likely, why I it got into the training data in the first place). For others, not so much.

9999000009991y ago

If I'm used to waiting 2 days, and you get it down to 30 seconds you can call me what ever you want.

I'm more pissed if I'm waiting days for a response.

1 more reply

jstanley1y ago

It's also possible that it's such an unlikely thing to hear that she actually misheard it and thought it said something nicer.

1 more reply

firefoxd1y ago

Fun fact, we fixed this issue by adding a #profanity tag and dropping the message to the next human agent.

Now our most prolific sales engineer could no longer run demos to potential clients. He had many embarrassing calls where the Ai would just not respond. His last name was Dick.

leobg1y ago

bedel231y ago

I wonder if the solution is to run the message through another LLM to make the message as polite as possible removing any profanities. Cost >2x as much to run though.

asjir1y ago

Maybe that was their first name, at least the one they put in lol

vessenes1y ago· 4 in thread

Humans work like this; we use lots of filler words as we sort of get going responding to things.

Very cool, thank you for sharing this.

za_mike1571y ago

Hi Vessenes

From Cerebrium here. Really appreciate the feedback - glad you had a good experience!

This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc

You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents

vessenes1y ago

c0brac0bra1y ago

I've wondered about this as well. Is there a way to have a small, efficient LLM model that can estimate general task complexity without actually running the full task workload?

Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.

vessenes1y ago

For sure: in fact MoE models train such a router directly, and the routers are not super large. But it would also be easy to run phi-3 against a request.

I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’

There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.

az2261y ago· 4 in thread

Your marketing says 500 but your math says 759.

dietr1ch1y ago

That's called marketing

vessenes1y ago

My tests had one outlier at 1400ms, and ten or so between 400-500ms. I think the marketing numbers were fair.

whizzter1y ago

500 are the transcription/llm/tts steps (ie the response time from data arriving on the server to sending back), the rest seems to be various non-AI "overheads" such as encoding, network traffic,etc.

vr000m1y ago

The latencies in the table are based on heuristics or averages that we’ve observed. However, in reality, based on the conversation, some of the larger latency components can be much lower.

aussieguy12341y ago· 4 in thread

Fast yes, but the voice sounds robotic.

bombela1y ago

I prefer a slighty robotic voice. This was way I know I am talking to a bot, and this sets expectations.

lofties1y ago

kwindlaOP1y ago

Voice models are getting both faster and more natural at a, well, a fast clip.

cloudking1y ago

It's literally a robot

luke-stanley1y ago· 3 in thread

There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

The breakdown provided is interesting.

phkahler1y ago

>> I'm curious about the browser providing this as a native option too.

regularfry1y ago

charlesyu1081y ago

Lol this announcement blows what ive been working on out of the water but i have a simple assistant implementation with rick0123/VAD + Websockets.

https://github.com/charlesyu108/voiceai-js-starter

spuz1y ago· 3 in thread

kwindlaOP1y ago

Full technical write-up here: https://www.daily.co/blog/the-worlds-fastest-voice-bot/

pavlov1y ago

It's a voice-to-text-to-voice approach, as implied by this description:

"host transcription, LLM inference, and voice generation all together in one place"

isaacfung1y ago

There are way more text training data than voice data. It also allows you to use all the benchmarks and tool integrations that have already been developed for LLMs.

SubiculumCode1y ago· 3 in thread

makeitmore1y ago

SubiculumCode1y ago

ps. The speed is impressive, but the key to a useful voice chatbot (which I've never seen) is one that adapts to your speaking style, identifies and employs turn-taking signals.

SubiculumCode1y ago

I'm sure that, with an annotated dataset, a model could learn to pick up on the right cues.

c0brac0bra1y ago· 2 in thread

I've been developing with Deepgram for a while, and this is one of the coolest demos I've seen with it!

To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?

za_mike1571y ago

Hey! From the Cerebrium team here!

So our costs are based on the infra you use to run your application and we charge per millisecond of compute.

c0brac0bra1y ago

Excellent; thanks for the info.

amluto1y ago· 2 in thread

Maybe silly question:

> jitter buffer [40ms]

Olreich1y ago

Almost any gap in audio is detectable and sounds really bad. 40ms is a lot, but sending 40ms of silence is probably worse

amluto1y ago

_def1y ago· 1 in thread

"Oh I think I figured out your secret!"

"Please tell me"

"You achieve the short response times by keeping a short context"

"You're absolutely right"

danielbln1y ago

andrewmcwatters1y ago· 1 in thread

I love it when engineers worth their salt actually do the back-of-the-envelope calculations for latency, etc.

The same math holds up today because of a combination of fundamental limits and state of the art limits.

Flumio1y ago

The calculations I was reading at the time suggested it would work for casual due to the gaming PC being very close to the game servers and running inside the best network available (googles).

Google also said that the controller would send the input straight to the server.

And a fast stadia server should have good fps combined with a little bit of brain prediction

hackerbob1y ago· 1 in thread

This is indeed fast! Also seems to be no issue interrupting it while speaking. Is this using WebRTC echo cancellation to avoid microphone and speaker audio mix ups?

makeitmore1y ago

yjftsjthsd-h1y ago· 1 in thread

Dumb question - I see 2 opus encodes and decodes for a total around 120ms; is opus the fastest option?

kwindlaOP1y ago

yalok1y ago· 1 in thread

kwindlaOP1y ago

spark_chicken1y ago· 1 in thread

i have tried it. it is really fast! I know making a real-time voice bot is not easy with this low latency. which LLM did you use? how large LLM to make the conversation efficient?

makeitmore1y ago

mdbackman1y ago

trueforma1y ago

asjir1y ago

andrewstuart1y ago

Damned impressive.

realyashnag1y ago

This was scary fast. Neat interface and (almost) indistinguishable from a human over the phone / internet. Kudos @cerebrium.ai.

etherealG1y ago

moshi by Kyutai seems to have beaten your approach by about 500ms, and they're going to release open source.

https://www.youtube.com/live/hm2IJSKcYvo

hn discussion here: https://news.ycombinator.com/item?id=40866569

dijit1y ago

I’m genuinely shocked by how conversational this is.

anonzzzies1y ago

mmcclure1y ago

Wow, Kwin, you’ve outdone yourself! The speed makes an even bigger difference than I expected going in.

Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).

_DeadFred_1y ago

jaybrendansmith1y ago

This thing is incredible. It finished a sentence I was saying.

gsjbjt1y ago

That's awesome - can you say anything about what datasets this was trained on? I assume something specifically conversational?

tamimio1y ago

Or we can say the latency is a good listening skills!! It was fast but occasionally interrupted me to answer.

andruby1y ago

This is really good. I'm blown away by how important the speed is.

And this was from a mobile connection in Europe, with a shown latency of just over 1s.

p_frank1y ago

sumedh1y ago

This is very impressive, me and my kid had fund talking about space.

ftth_finland1y ago

This is excellent!

Perfect comprehension and no problem even with bad accents.

isoprophlex1y ago

Jesus fuck that's fast, and I had no idea speed mattered that much. Incredible. Feels like an entirely different experience than the 5+ seconds latency with openai.

preciousoo1y ago

This is so cool!

Borborygymus1y ago

It /was/ nice and quick. Thanks for putting the demo online. It was quick to tell me complete nonsense. Apparently 7122 is the atomic number of Barium.

j / k navigate · click thread line to collapse