I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.
Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.
I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.
I often use it while I’m walking and tell it to not respond until I initiate a conversation.
The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.
But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.
Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.
Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?
They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.
https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)
https://nu-dialogue.github.io/j-moshi/ (japanese and english)
I must admit it's a bit weird when LLMs laugh, I don't really know how I feel about that but it seems to laugh at the right times. Very tangential, but cockatoos have been known to mimic the right time to laugh presumably based on tonal cues that a joke was just made (I have experienced this first hand with rescue birds who li e amongst humans)
I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.
I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).
I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).
Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.
1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.
Curious if you thought their approach was necessary, it seemed like a ton of complexity to reduce one of the faster parts of a voice AI setup. Having a fast model and accurate VAD seems way more important than fine tuning WebRTC transit times.
I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.
I read parts of it a while ago when I had an idea on using webRTC data channels to pass data from databases to browser clients via a CLI. Your book made me understand that it's probably not a great fit for my use case. I just used a centralized control plane and websockets instead.
I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet
You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.
I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.
> WebRTC is a standardized protocol for P2P communication. It allows two peers to exchange media and data. It is encrypted by default, and handles connectivity establishment in many different network conditions. It is supported in browsers, and has multiple out of browser implementations.[0]
import ("github.com/go-sql-driver/mysql")
so it's standard to have the library files in the root directory. ├── cmd
│ └── binary-name
│ └── main.go (may subpackage for things like CLI porcelain, etc)
├── go.mod
└── internal
└── app.go (and subpackages, etc)To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing
Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.
(Please Sam, if you read this, release the new realtime audio models)
Google’s Gemini flash live 3.1 is better, especially used via the API - it can do tool calling (including to other, even smarter LLMs if you set it up yourself), you can set the reasoning level (even high is still close enough to realtime) and it can ground answers in google search. I love bidirectional voice and right now it’s probably the best option. You can try it in AI studio
Just give me a option to have a slower response but better model…
But personally I've settled on just speaking to the slower models over a custom tts app, I find it being instant was not actually that important, and in the silence I find myself marinating in the discussion more anyway
Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3
https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name
And it's fully OSS- like n8n for voice AI, and you can use it with OpenClaw or Claude code - recently launched MCPs.Github- https://github.com/dograh-hq/dograh, Youtube -https://www.youtube.com/watch?v=sxiSp4JXqws&list=PLDqzGuN7B1...
- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."
- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.
- libwebrtc is the only game in town.
- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions
- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)
this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.
Even for clients you have things like libpeer that libwebrtc can't hit.
i think the challenge is that pion is an excellent product today. it would benefit me if its innovations were subsumed into libwebrtc, because eventually those innovations will show up in the iOS stack, which is one of the customers that matter to me. it is subjective if it is the MOST important customer, that is my belief and it is probably true of openai, at least until they get their own device out the door.
there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.
Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.
Node.js's initial release was May 27, 2009
Golang 's initial release was November 10, 2009
They're different, yes, but it's not like
Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?
That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.
I wonder if they run the STT model's output through the current model (that we're chatting with) as a final pass - since the text seem to be well aligned to the current conversation context.
For long prompts, I often speak to OAI web/app and copy-paste the text to Claude / Gemini :)
As someone use to podcast at 3x speech and sapi text to speech at much higher rate, listening to AI at human speech is a chore.
lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.
WebRTC + Kubernetes