How OpenAI delivers low-latency voice AI at scale (opens in new tab)

(openai.com)

510 pointsSean-Der1mo ago146 comments

146 comments

108 comments · 30 top-level

legohead1mo ago· 26 in thread

The low latency is more of a pain point than a good thing, the way they have it implemented. Trying to have a casual conversation with it, as humans we naturally pause, and GPT will take this as you are "done" and start blabbing away.

I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.

zamadatix1mo ago

I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.

hun31mo ago

They are orthogonal.

Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.

ericmcer1mo ago

I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative

4 more replies

janalsncm1mo ago

I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

wnmurphy1mo ago

100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.

1 more reply

taneq1mo ago

I find this is a problem even with human conversations. Some people just aren’t very good at telegraphing when they’ve finished ‘their turn’ talking. Or worse yet, aren’t willing to take turns in the first place.

discordance1mo ago

Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

1 more reply

throwuxiytayq1mo ago

With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.

The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.

But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.

dtran1mo ago

This has more to do with Voice Activity Detection (VAD) than the latency described in the article

lxgr1mo ago

That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

1 more reply

wnmurphy1mo ago

Exactly. It's a tangent, but clearly a pain point for enough users.

ehnto1mo ago

There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.

They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.

https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)

https://nu-dialogue.github.io/j-moshi/ (japanese and english)

I must admit it's a bit weird when LLMs laugh, I don't really know how I feel about that but it seems to laugh at the right times. Very tangential, but cockatoos have been known to mimic the right time to laugh presumably based on tonal cues that a joke was just made (I have experienced this first hand with rescue birds who li e amongst humans)

saturdaysaint1mo ago

In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.

jdironman1mo ago

Roger that, over.

futureshock1mo ago

Reducing the network latency helps with this exactly. OpenAI can make better timed decisions when to begin responding so it'll feel less like an interruption. I've also seen some research on full duplex voice models that handle interruption more like an organic conversation and low latency will help there as well

richardw1mo ago

Hard problem. I find myself adding in filler to stop the thing from jabbering.

I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.

lxgr1mo ago

Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

asdfman1231mo ago

Fwiw you can prompt it to respond differently to you.

jameshush1mo ago

This is more of a VAD/turn detection issue. It's gotten a lot better over the last few years, but it's a hard problem. The extra ~100ms of latency makes a huge difference otherwise, especially when you have use cases that require tool calling that can easily add 500ms+ of latency.

angry_octet1mo ago

It seems that tool calling shouldn't be 500ms of latency?

1 more reply

miki_oomiri1mo ago

People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.

christophilus1mo ago

Agreed. It’s stressful. I think they need to have an option to adopt a suffix, so they don’t start babbling until there is an “over” followed by a pause like in the old army walkie talkie days.

wnmurphy1mo ago

Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.

ericmcer1mo ago

yeh exactly, you cannot get a strong signal that a user is done speaking without some amount of “wait for 500ms of silence”. You could kick of processing and abandon if they continued talking, but that seems over optimized.

1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.

charcircuit1mo ago

The AI should be able to model a probability for when is a natural moment to start talking.

MagicMoonlight1mo ago

It’s possible to change the amount of time it waits if you’re using the API

Sean-DerOP1mo ago· 16 in thread

Very grateful that OpenAI published the article/publicized their usage of Pion[0] a library I work on. If you aren't familiar with WebRTC it's a super fun space. I work on a book WebRTC for the Curious [1] that details how it works.

[0] https://github.com/pion/webrtc

[1] https://webrtcforthecurious.com

ericmcer1mo ago

I use pion thanks for making it!

Curious if you thought their approach was necessary, it seemed like a ton of complexity to reduce one of the faster parts of a voice AI setup. Having a fast model and accurate VAD seems way more important than fine tuning WebRTC transit times.

Sean-DerOP1mo ago

Thanks for using it :)

I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.

aleda1451mo ago

Appreciate you putting the entire book online!

I read parts of it a while ago when I had an idea on using webRTC data channels to pass data from databases to browser clients via a CLI. Your book made me understand that it's probably not a great fit for my use case. I just used a centralized control plane and websockets instead.

I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet

Sean-DerOP1mo ago

Thanks for reading it!

You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.

I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.

1 more reply

oezi1mo ago

What is preventing the fun is that even though we now have IPv6 widely enough available we still can't have p2p connections in the browser without a cumbersome control plane of servers. If you could join a federation in the browser from some bootstrap IPs then I think we could have some real distributed fun.

dtran1mo ago

Thanks for WebRTC for the Curious and for Pion! Not using the latter directly, but have used both to better understand WebRTC

ssgodderidge1mo ago

For those unfamiliar with WebRTC, the Pion FAQ page has a good description:

> WebRTC is a standardized protocol for P2P communication. It allows two peers to exchange media and data. It is encrypted by default, and handles connectivity establishment in many different network conditions. It is supported in browsers, and has multiple out of browser implementations.[0]

[0]: https://github.com/pion/webrtc/wiki/FAQ#what-is-webrtc

thatxliner1mo ago

slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult

nemothekid1mo ago

Thats the default for go projects. Go imports are repository strings (e.g.):

     import ("github.com/go-sql-driver/mysql")

so it's standard to have the library files in the root directory.

dgunay1mo ago

FWIW I usually don't structure my Go projects this way unless they're very very small. This is what I usually do for anything larger than 2-3 files:

  ├── cmd
  │   └── binary-name
  │       └── main.go (may subpackage for things like CLI porcelain, etc)
  ├── go.mod
  └── internal
      └── app.go (and subpackages, etc)

saagarjha1mo ago

I assume this is why GitHub has the annoying #readme-ov-file slug

a4564631mo ago

This is valid criticism. Go fanbois don't like listening to any go criticism. They were all like who needs templates in go. and now go has templates.

To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing

5 more replies

willmeyers1mo ago

WebRTC is great and so is Pion, thanks for help making and maintaining it! I loved learning about WebRTC from WebRTC for the Curious!!!

ryanar1mo ago

I used pion and it was fantastic. Most of the article seems pretty standard webrtc techniques for performant voice.

haaz1mo ago

Only a software dev would start their referencing at 0 lol

vorticalbox1mo ago

I do this too I never made the connection.

Lucasoato1mo ago· 9 in thread

Wait a minute... I’m genuinely happy that they are sharing this, but keep in mind that realtime audio model from OpenAI are still stuck with the 4o family in terms of capabilities, sadly. I still find them so useful, such a pity that there’s no real competitor in this segment, having the experience a real conversation has helped me so much in expressing ideas and concepts.

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

dharma11mo ago

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Google’s Gemini flash live 3.1 is better, especially used via the API - it can do tool calling (including to other, even smarter LLMs if you set it up yourself), you can set the reasoning level (even high is still close enough to realtime) and it can ground answers in google search. I love bidirectional voice and right now it’s probably the best option. You can try it in AI studio

Lucasoato1mo ago

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

1 more reply

modeless1mo ago

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

artdigital1mo ago

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

artdigital1mo ago

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

radicality1mo ago

Yeah I was quite surprised that the advanced chat gpt voice mode can’t itself go and message the frontier model underneath to retrieve data and then speak it. I basically tried asking it for that (something like “can you go and ask gpt5.5 to research this more in depth, and while we wait, tell me about XYZ”), but apparently that’s not a thing.

TomGarden1mo ago

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

But personally I've settled on just speaking to the slower models over a custom tts app, I find it being instant was not actually that important, and in the silence I find myself marinating in the discussion more anyway

sails1mo ago

You can feel what is possible using Gemini speech to speech model, it can do tool calls and is very fast. It lacks somewhat in thinking capability but you can setup a tool call to a smarter model and it acts as a relay. I’ve been very impressed.

ddp261mo ago

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

Aeroi1mo ago· 5 in thread

if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat

pncnmnp1mo ago

I wish I had known about Pipecat a lot sooner. I found out about it a few weeks back, and since Gemma 4 launched, I've been building my own entirely local voice assistant using Gemma 4 + Kokoro TTS + Whisper from scratch - https://github.com/pncnmnp/strawberry.

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

zarldev1mo ago

Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

1 more reply

AnthOlei1mo ago

What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency

2 more replies

pk-voice1mo ago

If you like Pipecat’s focus on speed, you might also try out our open source, which comes with all the batteries included (knowledge base, telephony/SIP, variables, BYOK any LLM STT TTS, Speech to Speech, etc )

And it's fully OSS- like n8n for voice AI, and you can use it with OpenClaw or Claude code - recently launched MCPs.Github- https://github.com/dograh-hq/dograh, Youtube -https://www.youtube.com/watch?v=sxiSp4JXqws&list=PLDqzGuN7B1...

BoxedEmpathy1mo ago

I've been looking at this! Great project.

doctorpangloss1mo ago· 5 in thread

what i learned from making a webrtc+kubernetes game streaming product:

- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."

- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.

- libwebrtc is the only game in town.

- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions

- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)

this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.

Sean-DerOP1mo ago

Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?

Even for clients you have things like libpeer that libwebrtc can't hit.

doctorpangloss1mo ago

yes - i used libwebrtc on the backend and, pre-LLM, patched it to work around a lot of the things i discovered that were directly related to low latency AV streaming. pion didn't exist then.

i think the challenge is that pion is an excellent product today. it would benefit me if its innovations were subsumed into libwebrtc, because eventually those innovations will show up in the iOS stack, which is one of the customers that matter to me. it is subjective if it is the MOST important customer, that is my belief and it is probably true of openai, at least until they get their own device out the door.

there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.

jiggawatts1mo ago

Something I noticed is that companies that are vibe-coding their products miss out on the intelligence that (still) only humans can bring to bear. Just the knowledge cutoff alone puts AI at a serious disadvantage in any rapidly changing field.

fragmede1mo ago

GPT 5.5's knowledge cutoff is August 2025. Which aspect of WebRTC has meaningfully changed since then?

3 more replies

chevman1mo ago

When you have hard problems with unclear optimal solutions, taking this approach of a public show & tell will often (always?) solicit lots of interesting ideas the team may have not yet considered :)

rvz1mo ago· 5 in thread

OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.

Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.

bananamogul1mo ago

"something immature as TypeScript / Node or whatever"

Node.js's initial release was May 27, 2009

Golang 's initial release was November 10, 2009

They're different, yes, but it's not like

mghackerlady1mo ago

okay, sure, but one is by microsoft, the other by a 25 year old, and another by rob pike. The one by rob pike is going to be infinitely more mature and thought out than a hacky type system on JS because it isn't his first rodeo

nvarsj1mo ago

Can golang do zero copy networking nowadays? In the past golang was terrible at this kind of thing due to allocations and copies of all relayed data.

troxhi1mo ago

Even Java together with Netty supports zero copy networking... if Go misses that feature I wouldn't be very hard to implement it yourself

fragmede1mo ago

And the GC!

qrush1mo ago· 4 in thread

Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?

fidotron1mo ago

It does appear that way. The LiveKit server is not what you would want for this architecture anyway (as they basically say with the SFU discussion), although it does have a lot of useful stuff in the client SDKs.

fuddle1mo ago

They do link to the Livekit docs in the footnotes: https://docs.livekit.io/transport/self-hosting/kubernetes/

zuzululu1mo ago

whats wrong with livekit ?

geekin231mo ago

they are not and haven’t been from what I hear since last Jan… I also have some friends work in listed companies on their website within real time divisions but haven’t used livekit, only signed up. It’s kinda shady tbh

anzerarkin1mo ago· 3 in thread

I hate the voice ai though, it's so much dumber

brett-jackson1mo ago

I used to use it all the time until about a year ago or so. Its responses are full of filler and the safeguards are really overbearing. It often will just give wrong answers in a way that GPT-5.x does not. I once asked it why a particular celebrity was canceled and it refused to tell me because it may harm me to know what they said!

NikolaNovak1mo ago

Fwiw - I found the advanced AI voice feature to be actually detrimental. It's good if you just want a single sentence answer. I've turned it off though when I want a more detailed, structured, considered answer.

drusepth1mo ago

Interestingly, that kind of parallels the real world too: if you want a quick and high level answer, talk to someone in person; if you want something detailed and info-dense, get them to write it down.

2 more replies

thimabi1mo ago· 2 in thread

> Voice AI only feels natural if conversation moves at the speed of speech […] At OpenAI’s scale, that translates into three concrete requirements: Global reach for more than 900 million weekly active users

Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?

That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.

stuartmemo1mo ago

Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.

janalsncm1mo ago

To defend them a little: voice is a little rough around the edges now, so there’s a chicken and egg problem of whether to prioritize improving voice if usage isn’t high partially because it’s clunky.

1 more reply

flakiness1mo ago· 2 in thread

Should I or shouldn't I be glad to see zero mention on Codex.

mock-possum1mo ago

Shouldn’t, I think - advanced voice is a surprisingly slick feature, and if you’re someone who feels that they can think and speak more naturally than when they think and type, AI voice transcription is kind of huge.

gyanchawdhary1mo ago

100% .. as a product designer/developer, i use it heavily for early feature ideation .. i’ll do a loose, exploratory back and forth on a long walk .. then pass the transcript to claude to validate and turn into a spec ..

charisma1231mo ago· 1 in thread

If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?

Sean-DerOP1mo ago

It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.

[0] https://github.com/pion/webrtc-zero-downtime-restart

didibus1mo ago

I wouldn't mind waiting longer for answers that would go through a better model with more thinking. As long as it has good support for interrupting and also it doesn't start answering as soon as I pause for 1 second and it's smart about knowing I'm done speaking.

amirathi1mo ago

I find OpenAI's speech-to-text model the best of the lot. It can handle my & my 5-year old daughter's Indian accent pretty well.

I wonder if they run the STT model's output through the current model (that we're chatting with) as a final pass - since the text seem to be well aligned to the current conversation context.

For long prompts, I often speak to OAI web/app and copy-paste the text to Claude / Gemini :)

shevy-java1mo ago

I don't like AI in general, and on youtube there are soooooo many horrible videos with voice AI. Having said that, I did notice AI has actually worked for some hobbyist-maintained games for the most part. Example: BG2EE (Baldur's Gate 2 Enhanced Edition). Yes, this is a forgotten game; and I actually have background music as audio rather than listen to the dialogue, save for testing it, but for the most part it worked here. So for poor-ressource hobbyists, AI is actually not totally useless. For youtube I find only horribly crap examples. I don't watch any AI-involved videos (if I can spot it; so much fake on youtube these days, Google does not realise how AI is killing many old users and visitors here).

logickkk11mo ago

IMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?

vjay151mo ago

This is such a good write up, WebRTC is one of the coolest things ever! It's kinda genius to use the VIP approach, SFU is also pretty scalable but now they dont even have to do that

tracyhenry1mo ago

After all these, I still feel their voice AI interrupts quite a lot, especially when I pause just for 0.5 sec. Interestingly, when I tell it to interrupt less, it seems to be better.

1 more reply

maxglute1mo ago

>feels natural if conversation moves at the speed of speech

As someone use to podcast at 3x speech and sapi text to speech at much higher rate, listening to AI at human speech is a chore.

Saline95151mo ago

I never use the voice mode in the phone app, it's stuck with 4o for some reason. Same with Claude, that uses Haiku. Why can't they use a better model with thinking disabled?

zerop1mo ago

I have used voice mode on chatgpt, Gemini, Grok as I use it while driving. Best is from openAI. Natural conversation, smarter and meaningful replies.

furyofantares1mo ago

> Global reach for more than 900 million weekly active users

lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.

hnav1mo ago

RFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.

hiroakiaizawa1mo ago

Interesting. What are the main latency bottlenecks in practice?

CrzyLngPwd1mo ago

It's bad enough having to speed-read the waffle of its written answers; even when told to be concise, the thought of having to listen to it waffle on in its smarmy, sycohpantic fashion makes me want to reach for the sick bag.

whateveracct1mo ago

why is the "How" included here? it is often removed

tom1IIIl1iIL1mo ago

I think it's better to join some kind of club if you want to make friends?

AIorNot1mo ago

so is the answer

WebRTC + Kubernetes

devopsengine1mo ago

Inspired

jonahs1971mo ago

Who cares? Their company is dying.

cdrnsf1mo ago

It's missing the part where they explain how they obtained the training data for their voice AI.

j / k navigate · click thread line to collapse

146 comments

108 comments · 30 top-level

legohead1mo ago· 26 in thread

zamadatix1mo ago

hun31mo ago

They are orthogonal.

ericmcer1mo ago

I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative

4 more replies

janalsncm1mo ago

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

wnmurphy1mo ago

100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.

1 more reply

taneq1mo ago

discordance1mo ago

Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

1 more reply

throwuxiytayq1mo ago

With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.

dtran1mo ago

This has more to do with Voice Activity Detection (VAD) than the latency described in the article

lxgr1mo ago

That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

1 more reply

wnmurphy1mo ago

Exactly. It's a tangent, but clearly a pain point for enough users.

ehnto1mo ago

There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.

They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.

https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)

https://nu-dialogue.github.io/j-moshi/ (japanese and english)

saturdaysaint1mo ago

In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.

jdironman1mo ago

Roger that, over.

futureshock1mo ago

richardw1mo ago

Hard problem. I find myself adding in filler to stop the thing from jabbering.

lxgr1mo ago

Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

asdfman1231mo ago

Fwiw you can prompt it to respond differently to you.

jameshush1mo ago

angry_octet1mo ago

It seems that tool calling shouldn't be 500ms of latency?

1 more reply

miki_oomiri1mo ago

People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.

christophilus1mo ago

wnmurphy1mo ago

Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

ericmcer1mo ago

1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.

charcircuit1mo ago

The AI should be able to model a probability for when is a natural moment to start talking.

MagicMoonlight1mo ago

It’s possible to change the amount of time it waits if you’re using the API

Sean-DerOP1mo ago· 16 in thread

[0] https://github.com/pion/webrtc

[1] https://webrtcforthecurious.com

ericmcer1mo ago

I use pion thanks for making it!

Sean-DerOP1mo ago

Thanks for using it :)

I think It’s a case of you improve what you own. The owners of WebRTC servers were aggressively improving their part. They don’t own the inference servers.

aleda1451mo ago

Appreciate you putting the entire book online!

I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet

Sean-DerOP1mo ago

Thanks for reading it!

You can't beat Websockets :) Especially since you have so much tooling/existing stuff that works with HTTP.

I have been trying to get a website off the ground that does Datachannels + SQlite in the browser and then users sync between each other. I have gotten distracted so many times though.

1 more reply

oezi1mo ago

dtran1mo ago

Thanks for WebRTC for the Curious and for Pion! Not using the latter directly, but have used both to better understand WebRTC

ssgodderidge1mo ago

For those unfamiliar with WebRTC, the Pion FAQ page has a good description:

[0]: https://github.com/pion/webrtc/wiki/FAQ#what-is-webrtc

thatxliner1mo ago

slightly unrelated but what’s with storing the entire codebase in the root directory instead of a nested src folder? It makes getting to the README a lot more difficult

nemothekid1mo ago

Thats the default for go projects. Go imports are repository strings (e.g.):

     import ("github.com/go-sql-driver/mysql")

so it's standard to have the library files in the root directory.

dgunay1mo ago

FWIW I usually don't structure my Go projects this way unless they're very very small. This is what I usually do for anything larger than 2-3 files:

  ├── cmd
  │   └── binary-name
  │       └── main.go (may subpackage for things like CLI porcelain, etc)
  ├── go.mod
  └── internal
      └── app.go (and subpackages, etc)

saagarjha1mo ago

I assume this is why GitHub has the annoying #readme-ov-file slug

a4564631mo ago

This is valid criticism. Go fanbois don't like listening to any go criticism. They were all like who needs templates in go. and now go has templates.

To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing

5 more replies

willmeyers1mo ago

WebRTC is great and so is Pion, thanks for help making and maintaining it! I loved learning about WebRTC from WebRTC for the Curious!!!

ryanar1mo ago

I used pion and it was fantastic. Most of the article seems pretty standard webrtc techniques for performant voice.

haaz1mo ago

Only a software dev would start their referencing at 0 lol

vorticalbox1mo ago

I do this too I never made the connection.

Lucasoato1mo ago· 9 in thread

Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.

(Please Sam, if you read this, release the new realtime audio models)

dharma11mo ago

Yes the voice part of OpenAI realtime/voice mode is great but it’s pretty dumb compared to newer models and often gets stuck repeating itself.

Lucasoato1mo ago

Thanks, I’ll try it, even if my experience wasn’t that great with Google models lately (503s)

1 more reply

modeless1mo ago

Grok voice is surprisingly good, actually. It's still a dumber model than the thinking modes of frontier models, but it's less dumb than the voice modes of other providers.

artdigital1mo ago

Grok voice model is also a thinking model. I agree that it’s far better than the other voice models

Just give me a option to have a slower response but better model…

artdigital1mo ago

This is what makes their voice mode unusable to me. I can’t stand the way 4o replies and it’s such a big jump in quality from text mode

radicality1mo ago

TomGarden1mo ago

Claude voice mode has come a long way! I'd say it's smarter than CGPT AVM last time i tried it.

sails1mo ago

ddp261mo ago

Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"

Aeroi1mo ago· 5 in thread

if anyone is looking to get into this. pipecat is a great open-source repo and community. https://github.com/pipecat-ai/pipecat

pncnmnp1mo ago

Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3

zarldev1mo ago

Yeah Gemma4 was and is great fun to do this with - I too am building pretty much the same as yourself in Go.

https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name

1 more reply

AnthOlei1mo ago

What do you have going on the hardware side? I want to plug this into hass but don’t know what hardware I need for reasonable latency

2 more replies

pk-voice1mo ago

BoxedEmpathy1mo ago

I've been looking at this! Great project.

doctorpangloss1mo ago· 5 in thread

what i learned from making a webrtc+kubernetes game streaming product:

- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.

- libwebrtc is the only game in town.

- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)

this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.

Sean-DerOP1mo ago

Did you use libwebrtc on the backend? When you say `libwebrtc` is the only game in town are you talking about clients or servers?

Even for clients you have things like libpeer that libwebrtc can't hit.

doctorpangloss1mo ago

yes - i used libwebrtc on the backend and, pre-LLM, patched it to work around a lot of the things i discovered that were directly related to low latency AV streaming. pion didn't exist then.

there can be many, many use cases though! not everything has to be, try to make the thing for 1b people that has to interact with all the most powerful and meanest businesses on the planet.

jiggawatts1mo ago

fragmede1mo ago

GPT 5.5's knowledge cutoff is August 2025. Which aspect of WebRTC has meaningfully changed since then?

3 more replies

chevman1mo ago

When you have hard problems with unclear optimal solutions, taking this approach of a public show & tell will often (always?) solicit lots of interesting ideas the team may have not yet considered :)

rvz1mo ago· 5 in thread

OpenAI uses Go for the networking implementation for the relays and the services, which makes a ton of sense, instead of something immature as TypeScript / Node or whatever.

Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.

bananamogul1mo ago

"something immature as TypeScript / Node or whatever"

Node.js's initial release was May 27, 2009

Golang 's initial release was November 10, 2009

They're different, yes, but it's not like

mghackerlady1mo ago

nvarsj1mo ago

Can golang do zero copy networking nowadays? In the past golang was terrible at this kind of thing due to allocations and copies of all relayed data.

troxhi1mo ago

Even Java together with Netty supports zero copy networking... if Go misses that feature I wouldn't be very hard to implement it yourself

fragmede1mo ago

And the GC!

qrush1mo ago· 4 in thread

Am I reading this right that OpenAI is not using Livekit for WebRTC/audio anymore?

fidotron1mo ago

fuddle1mo ago

They do link to the Livekit docs in the footnotes: https://docs.livekit.io/transport/self-hosting/kubernetes/

zuzululu1mo ago

whats wrong with livekit ?

geekin231mo ago

anzerarkin1mo ago· 3 in thread

I hate the voice ai though, it's so much dumber

brett-jackson1mo ago

NikolaNovak1mo ago

drusepth1mo ago

2 more replies

thimabi1mo ago· 2 in thread

Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?

That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.

stuartmemo1mo ago

Yeah, that's why they've used "reach" - the total number of users who could be exposed to the feature regardless of engagement.

janalsncm1mo ago

1 more reply

flakiness1mo ago· 2 in thread

Should I or shouldn't I be glad to see zero mention on Codex.

mock-possum1mo ago

gyanchawdhary1mo ago

charisma1231mo ago· 1 in thread

If a transceiver crashes during a stream, how is the active session recovered? Does the system automatically re-establish the context in a new WebRTC session?

Sean-DerOP1mo ago

It doesn't today, but you could with sometime like this [0]. You can save/suspend all WebRTC state and bring it back with the next process.

[0] https://github.com/pion/webrtc-zero-downtime-restart

didibus1mo ago

amirathi1mo ago

I find OpenAI's speech-to-text model the best of the lot. It can handle my & my 5-year old daughter's Indian accent pretty well.

I wonder if they run the STT model's output through the current model (that we're chatting with) as a final pass - since the text seem to be well aligned to the current conversation context.

For long prompts, I often speak to OAI web/app and copy-paste the text to Claude / Gemini :)

shevy-java1mo ago

logickkk11mo ago

IMO this probably isn't just about latency. keeping people in voice gives them training data text never will. is that why they were fine going transceiver over sfu and mostly ignoring multi-party?

vjay151mo ago

This is such a good write up, WebRTC is one of the coolest things ever! It's kinda genius to use the VIP approach, SFU is also pretty scalable but now they dont even have to do that

tracyhenry1mo ago

After all these, I still feel their voice AI interrupts quite a lot, especially when I pause just for 0.5 sec. Interestingly, when I tell it to interrupt less, it seems to be better.

1 more reply

maxglute1mo ago

>feels natural if conversation moves at the speed of speech

As someone use to podcast at 3x speech and sapi text to speech at much higher rate, listening to AI at human speech is a chore.

Saline95151mo ago

I never use the voice mode in the phone app, it's stuck with 4o for some reason. Same with Claude, that uses Haiku. Why can't they use a better model with thinking disabled?

zerop1mo ago

I have used voice mode on chatgpt, Gemini, Grok as I use it while driving. Best is from openAI. Natural conversation, smarter and meaningful replies.

furyofantares1mo ago

> Global reach for more than 900 million weekly active users

hnav1mo ago

RFC 9297 support can't come quick enough in browsers. Would obviate having to deal with WebRTC in a client-server scenario.

hiroakiaizawa1mo ago

Interesting. What are the main latency bottlenecks in practice?

CrzyLngPwd1mo ago

whateveracct1mo ago

why is the "How" included here? it is often removed

tom1IIIl1iIL1mo ago