undefined | Better HN

0 pointsSean-Der1d ago0 comments

Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

> TTS is faster than real-time

https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms

> We really hope the user’s source IP/port never changes, because we broke that functionality.

That is supported. When new IP for ufrag comes in its supported

> It takes a minimum of 8* round trips (RTT)

That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/

> I’d just stream audio over WebSockets

You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)

----

I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.

0 comments

kixelated1d ago

HELLO MR SEAN,

1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.

2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?

3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.

4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.

5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.

I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.

The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.

Sean-DerOP18h ago

1.) Latency vs quality doesn't come up enough to make people want to A/B test it unfortunately. At work I would say ~5 people care about WebRTC vs QUIC vs X. All effort is around the models (how can I provide tools to be support those doing that work)

2.) The model isn't processing just text anymore. Also taking into account breathing/emotion etc... not just spitting out big responses anymore. As it generates them it is taking into account the users response.

3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.

4.) With DTLS 1.3 it is 1 RTT with SNAP[0] for WebRTC session. SCTP info goes in Offer/Answer, DTLS is packed into ICE. You are totally right about signaling though! [1] was my answer for doing WebRTC without signaling, couldn't get anyone to care though.

5.) I don't have anything that I need to tune. If I want to increase (or decrease) latency [3] is something I put into Transceiver. Otherwise I can't think of any 'change this WebRTC behavior' that has been asked by users/developers.

[0] https://datatracker.ietf.org/doc/draft-hancke-tsvwg-snap/

[1] https://github.com/pion/offline-browser-communication

[3] https://webrtc.googlesource.com/src/+/refs/heads/main/docs/n...

pocksuppet1d ago

Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.

coredog6410h ago

This is how Nova Sonic works. Having done some implementations it’s trickier than you might like (e.g. the Python library for Sonic had problems with echoes and we had to use the Java library)

vlovich1231d ago

You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard

2 more replies

rfv67231d ago

Human spoken conversation doesn’t really work like file buffering.

People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.

For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.

Aurornis19h ago

> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

LLMs are surprisingly good at this, too.

This entire blog post is based on assumptions that

1) WebRTC garbling is common

2) LLMs fall apart if there are any audio glitches

I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons

cduzz19h ago

I think this is mixing domains quite a bit;

If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.

Where on this scale does "person talking to LLM" fit?

I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.

You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)

Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.

8eye20h ago

The misunderstanding the user comes down to understanding how the user prompts and what type of responses the user gets in return. I’m wondering if for anything code the llm could have an interrupter that it would first read what the user wrote and translate it to proper sentence structure and the return a truer value. I think the llm is having an understanding issue because everyone has a unique signature in how they explain something. That signature operates like a personal language of the user as to which most of us will run through different scenarios to come up with a conclusion from our personal signature/language in which we conduct ourselves. And since llms are gamed to get to the answer faster using less tokens it probably picks the average high level signature that can be used for multiple users.

lelanthran1d ago

> > …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

> This is the opposite of the feedback I get. Users want instant responses.

I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.

Deeply skeptical!

Aurornis19h ago

Deeply false dichotomy!

The blog post glosses over the details and implies that 200ms of latency would be a magic solution. They do admit that WebRTC already has provisions for up to 200ms, so I guess they’re really implying that 400ms would be the happy case path for their alternative buffering, which is starting to get in the range where users would probably be annoyed.

Have you tried having conversational speech over a link with almost half a second of delay? It’s bad. You have to work hard to establish a turn taking routine with the other party and do extra mental work to identify your slot to talk.

The other half of this problem requires acknowledging that LLMs are actually pretty decent at interpreting input with gaps. You can drop words or even letters from LLM input and still get surprisingly decent results back. This post acts like a dropped packet means your response is going to send the LLM off on a wrong response or something.

regularfry23h ago

Oh, I can absolutely believe it. Humans are deeply irrational, especially about things that mess about in time frames too short for our conscious thought processes to kick in. Instant but confident sounding (and confident sounding because it's instant) will beat slower every time. You don't know which is correct until a long time after you've made a decision to trust it, or whether you like it.

lelanthran23h ago

> Instant but confident sounding (and confident sounding because it's instant) will beat slower every time.

Sure, but I am skeptical that users are actually saying "I prefer wrong answers over lag", which is what the post I responded to implied.

This is different to user's saying "I prefer quick answers to laggy answers", which is what I presume they may have said.

To actually settle this, the feedback must answer the question "Do you want wrong answers quickly or correct answers with an added 0.2 second delay?" because, well, those are the only two options right now.

2 more replies

mft_1d ago

100% agree. Sounds like they're either asking the wrong questions, or quoting answers selectively to suit this argument.

sporkland19h ago

I think as a user I have 2 modes: 1. Q&A mode where it's basically Google search by voice. 2. I'm trying to process an idea I have with an LLM buddy.

My desires are pretty different in the two scenarios. Q&A mode if it's not quick to respond I'll think something is wrong with my phone.

Deep think mode I'm honestly kind of pissed off at how fast it tries to respond. I want it to slow down and give me a chance to process and use extra compute on its side (including newer models) so it doesn't just spew low thought bullshit at me.

It seems like the system could detect which of these two modes was happening and adapt, including protocol.

I haven't tried the voice mode since the new model updates, maybe it's gotten better.

Counter to everything I just said though and germain to the topic at hand, when I'm in q&a mode that's probably the worst time for it to drop audio as it changes the query significantly. vs when I'm talking at it for 2 minutes it could probably throw half away.

moffkalast1d ago

Especially when 200ms is the rule of thumb for things still feeling "instant" to users in terms of UX, this is like a rounding error in terms of latency when I regularly wait for actual minutes for an LLM to finish its bloody thinking and have to refresh through several "we're experiencing heavy load" errors.

jasonlotito20h ago

> I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.

You are skeptical that people would prefer instant responses with 99.99% accuracy to waiting noticeably longer for a higher-accuracy rate?

The Internet, over its entire history, suggests otherwise.

lelanthran17h ago

Who claimed 99.99% accuracy?

A single dropped or missed word in a sentence can reverse the meaning.

I am skeptical that people would rather have wrong answers than lag. I am not claiming what the percentage is and neither are you, because no one measured it at the low lag.

shunia_huang1d ago

I would be punching my phone if the stupid network causing a wrong prompt and the LLM sends me unrelated answers. Correctness should be foundational no matter what, then improve the latency as best as possible. We all understand that if the network is bad then the latency can not be guaranteed but correctness should be.

toast01d ago

> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.

WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.

My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.

I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.

[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.

fidotron19h ago

> I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.

This is closer to being the real problem with WebRTC than the whole "it's making decisions about latency that I disagree with".

If you had a way to setup the tracks/channels over UDP connections that didn't involve P2P/STUN/TURN etc. but got to keep all the codec negotiation and things like AEC that would be awesome. MoQ isn't that though, because it's by people that don't actually see the whole problem end-to-end; just their little piece of it in the middle.

adgjlsfhk11d ago

> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.

sbrother1d ago

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.

Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.

kixelated1d ago

To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.

bredren1d ago

It is very important, the low latency.

I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.

Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.

I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.

I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.

The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)

The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.

It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.

latexr1d ago

> when I'm out and about I want to have a conversation with the Star Trek computer.

But you’re not. And you won’t. You’ll never have a conversation with the Star Trek computer while you continue to place anything else above accuracy. Every time I see someone comparing LLMs to the Star Trek computers, it seems to be someone who doesn’t understand that correctness was their most important feature. I’m starting to get the feeling people making that comparison never actually watched or understood Star Trek.

A computer which gives you constant bullshit is something only the lowest of the Ferengi would try to sell.

> This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.

It’s not. It absolutely is not and will never be. Not unless all you’re looking for is affirmation, companionship, titillation. I suggest looking for that outside chat bots.

regularfry23h ago

Delivery of first phoneme and delivery of the important information don't have to be coupled. Politicians on TV get very good at this particular trick, they've got a set of stock phrases which basically fill time while their brain gets in gear. We just need something to fill the gap so our System 1 doesn't lose confidence in the interaction.

HarHarVeryFunny20h ago

So you could just locally generate the "You're absolutely right! ..." prefix without even waiting for the response to stream in!

651018h ago

Do speech to text on the client and send the text/subtitles along with the audio.

If the connection is truly bad, upload your voice and quantify emotional payload.

2 more replies

DaleCurtis17h ago

FWIW, the getUserMedia() portion of such a setup remains the same, so you don't lose AEC or anything else coupled there.

croes1d ago

> This is the opposite of the feedback I get. Users want instant responses.

Did they really say they prefer fast response over accurate repsonse?

dolmen19h ago

This is assuming the LLM can produce an accurate response.

croes17h ago

Unlikely if the task gets inaccurately transmitted.

j / k navigate · click thread line to collapse

0 comments

kixelated1d ago

HELLO MR SEAN,

2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?

The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.

Sean-DerOP18h ago

3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.

[0] https://datatracker.ietf.org/doc/draft-hancke-tsvwg-snap/

[1] https://github.com/pion/offline-browser-communication

[3] https://webrtc.googlesource.com/src/+/refs/heads/main/docs/n...

pocksuppet1d ago

coredog6410h ago

This is how Nova Sonic works. Having done some implementations it’s trickier than you might like (e.g. the Python library for Sonic had problems with echoes and we had to use the Java library)

vlovich1231d ago

You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard

2 more replies

rfv67231d ago

Human spoken conversation doesn’t really work like file buffering.

For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.

Aurornis19h ago

LLMs are surprisingly good at this, too.

This entire blog post is based on assumptions that

1) WebRTC garbling is common

2) LLMs fall apart if there are any audio glitches

cduzz19h ago

I think this is mixing domains quite a bit;

Where on this scale does "person talking to LLM" fit?

8eye20h ago

lelanthran1d ago

> > …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

> This is the opposite of the feedback I get. Users want instant responses.

I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.

Deeply skeptical!

Aurornis19h ago

Deeply false dichotomy!

regularfry23h ago

lelanthran23h ago

> Instant but confident sounding (and confident sounding because it's instant) will beat slower every time.

Sure, but I am skeptical that users are actually saying "I prefer wrong answers over lag", which is what the post I responded to implied.

This is different to user's saying "I prefer quick answers to laggy answers", which is what I presume they may have said.

2 more replies

mft_1d ago

100% agree. Sounds like they're either asking the wrong questions, or quoting answers selectively to suit this argument.

sporkland19h ago

I think as a user I have 2 modes: 1. Q&A mode where it's basically Google search by voice. 2. I'm trying to process an idea I have with an LLM buddy.

My desires are pretty different in the two scenarios. Q&A mode if it's not quick to respond I'll think something is wrong with my phone.

It seems like the system could detect which of these two modes was happening and adapt, including protocol.

I haven't tried the voice mode since the new model updates, maybe it's gotten better.

moffkalast1d ago

jasonlotito20h ago

> I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.

You are skeptical that people would prefer instant responses with 99.99% accuracy to waiting noticeably longer for a higher-accuracy rate?

The Internet, over its entire history, suggests otherwise.

lelanthran17h ago

Who claimed 99.99% accuracy?

A single dropped or missed word in a sentence can reverse the meaning.

I am skeptical that people would rather have wrong answers than lag. I am not claiming what the percentage is and neither are you, because no one measured it at the low lag.

shunia_huang1d ago

toast01d ago

> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.

I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.

fidotron19h ago

> I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.

This is closer to being the real problem with WebRTC than the whole "it's making decisions about latency that I disagree with".

adgjlsfhk11d ago

> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.

sbrother1d ago

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

kixelated1d ago

To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.

bredren1d ago

It is very important, the low latency.

I prompt orchestrations most of the day, and am very particular about the fidelity of my context stack.

Yet I’ve used advanced voice mode on ChatGPT via the iOS app a lot. And I have not had a problem with it understanding my requests or my side of the conversation.

I have looked at the dictation of my side and seen it has blatant mistakes, but I think the models have overcome that the same way they do conference audio stt transcripts.

I have had times where the ~sandbox of those conversations and their far more limited ability to build useful corpus of context via web searches or by accessing prior conversation content.

The biggest problem I have had with adv voice was when I accidentally set the personality to some kind of non emotional setting. (The current config seems much more nuanced)

The AI who normally speaks with relative warmth and easy going nature turned into an emotionless and detached entity.

It was unable to explain why it was acting this way. I suspect the low latency did disservice there because when it is paired with something adversarial it was deeply troubling.

latexr1d ago

> when I'm out and about I want to have a conversation with the Star Trek computer.

A computer which gives you constant bullshit is something only the lowest of the Ferengi would try to sell.

> This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.

It’s not. It absolutely is not and will never be. Not unless all you’re looking for is affirmation, companionship, titillation. I suggest looking for that outside chat bots.

regularfry23h ago

HarHarVeryFunny20h ago

So you could just locally generate the "You're absolutely right! ..." prefix without even waiting for the response to stream in!

651018h ago

Do speech to text on the client and send the text/subtitles along with the audio.

If the connection is truly bad, upload your voice and quantify emotional payload.

2 more replies

DaleCurtis17h ago

FWIW, the getUserMedia() portion of such a setup remains the same, so you don't lose AEC or anything else coupled there.

croes1d ago

> This is the opposite of the feedback I get. Users want instant responses.

Did they really say they prefer fast response over accurate repsonse?

dolmen19h ago

This is assuming the LLM can produce an accurate response.

croes17h ago

Unlikely if the task gets inaccurately transmitted.

j / k navigate · click thread line to collapse