> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
Maybe it's a comprehension issue on my end, but he seems to associate things like stun and dtls as related, compounding issues (particularly in round trip time), but they are really orthogonal.
Also, he spends too much time talking about how you can't resend packets, and reiterates that point by stating they tried really hard (at discord?). That's where he lost the plot, imo.
The RTC in WebRTC is about real time communication. Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate. To clarify, I'm talking about human speech here.
If you want to tolerate packet loss, use a protocol based on tcp instead of udp. But you know what happens when you send audio over poor network conditions with tcp? There will be pauses on the receiving end as it waits for the next correct packet. Let's say the delay is multiple seconds. What should the receiving end do when packets start flowing again? Plays the clogged audio at a natural clock? Attempt to play the audio back at a higher rate to "catch up" with any other channels? People, humans, do not generally prefer that experience.
Forget about WebRTC for a minute, but instead think about tcp vs udp for voice. Voip has been based on udp since the 90's for a reason.
I felt that comment my bones. Why would anyone possibly have the need to know actual presentation timestamp and how that corresponds to actual realtime? Evidently, no one working on WebRTC has had to synchronise data streams from varying sources before with millisecond accuracy.
I was doing a demo for a video stabilisation using a webcam and IMU module in the browser. It turns out the latency between video->rtc->browser and sensor->websocket->browser are wildly different and not constant. The obvious solution would be to send UTC timestamps for the sensors data and synchronise in browser. Not possible, the video has no UTC timestamp reference. When you have control of both sides of the WebRTC pipe, you can do fun things like send the UTC timestamp of the start of the stream, but this won’t solve browser jitter. It worked well enough for a POC but the entire solution had to be reengineered.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.
I never would have imagined that OpenAI is sending the full audio of a request to their servers. I had always assumed the audio was transcribed locally and then sent to the server.
The only reason I can think they'd want the full audio is for later model training, which, ok, fair-enough, but this can still likely be done without the limitations of WebRTC.
I've experienced super deranged behavior out of 1800CHATGPT too, when I was just bored and called to ask how she's doing, what's her day like, she spiraled into laughing maniacally. It was unsettling, that was just before the service became unreliable, so I'm really curious what changed about the architecture.
This blog was super insightful for me to understand what are the root problems in the current implementation though.
Had a nice chuckle.
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.
IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.