Where's the future? A likely candidate may come out of the "Media over QUIC" work in the IETF, which already has several straw man protocols in real world use (Meta's RUSH, Twitch's WARP). It'll be a few more years before we see a real successor, but whatever it is will likely be able to supersede both WebRTC and HLS/DASH where QUIC and/or WebTransport is available.
> It's incredibly complex as a specification
What is complex about it? I can go and read the IETF drafts, webrtcforthecurious.com, https://github.com/adalkiran/webrtc-nuts-and-bolts and multiple implementations.
QUIC/WebTransport seems simple because it doesn't address all the things WebRTC does.
> has limitations and numerous issues that set limits in how scalable it can be
https://phenixrts.com/en-us/ does 500k viewers. I don't think anything about WebRTC makes it unscalable.
-----
IMO the future is WebRTC.
* Diverse users makes the ecosystem rich. WebRTC supports Conferencing, embedded, P2P/NAT Traversal, remote control... Every group of users has the made the ecosystem a little better.
* Client code is minimal. For most users they just need to exchange Session Descriptions and they are done. You then have additional APIs if you need to change behaviors. Other streaming protocols expect you to put lots of code client side. If you want to target lots of platforms that is a pretty big burden.
* Lots of implementations. C, C++, Python, Go, Typescript
* The new thing needs to be substantially better. I don't know what n is, but it isn't enough to just be a little better then WebRTC to replace it.
Partially agree here, but the design of QUIC(/WebTransport/TCPLS) make some of the features in WebRTC unnecessary:
1. No need for STUN/TURN/ICE. With QUIC you can have the NATed party make an outbound request to a non-NATed party, then use QUIC channels to send/receive RDP from the sender and receiver.
2. QUIC comes with encryption so you don't need to mess with DTLS/SRTP
3. Scaling QUIC channels is much more similar to scaling a stateless service than scaling something heavily stateful like a videobridge and should be easier to manage with modern orchestration tools.
4. For simple, 1:1 cases, QUIC needs a lot less signaling overhead than a WebRTC implementation. For other VC configurations, a streaming layer on QUIC will probably need to implement some form of signaling and will end up looking just like WebRTC signaling.
---
I just wish WebRTC wasn't so prescriptive of DTLS/SRTP. I'm often fiddling around with VC and video feeds on private networks (for example IPSec or an encrypted VPN like Zerotier), and having to opt into the whole CA system there makes it a bit of a pain. There's also the background that having the browser read from a video or voice source isn't always very low-latency even if the DTLS/SRTP comms is going as fast as the network can, which leads to slower glass-to-glass latency, though there are non-browser ways to use WebRTC and many language frameworks as you indicated.
All-in-all small complaints for a good technology stack though.
1) Lack of client-side buffering. This is a benefit in real-time communication, but it limits your maximum bitrate to your maximum download speed. It’s also incredibly insensitive to network blips.
2) Extremely expensive. To keep bitrate down, video codecs only send key frames every so often. When a new client starts consuming a video stream they need to notify the sender that a new key frame is needed. For a video call, this is fine because the sender is already transcoding their stream so inserting a key frame isn’t a big deal. For a static video, needing to transcode the entire thing in real time with dynamic key frames is expensive and unnecessary.
There are some webrtc solution for getting those streams into home assistant with low latency but they are... I don't know the word. They aren't difficult to set up because the instructions are very simple, however, they don't work when I follow them and, from reading forums, that's not uncommon. I have _no_ idea why it doesn't work.
I don't really understand why I can't spin up a docker container that will take my rtmp streams and convert them to webrtc then hook that into home assistant.
I've gathered that webrtc just doesn't work that way but why can't it?
https://phenixrts.com/en-us/faqs.html
> The scalability of Phenix’s platform does not come from the protocol itself, but from the systems built and deployed to accept WebRTC connections and deliver content through them. Our platform is built to scale out horizontally. In order to serve millions of concurrent users subscribing to the same stream in a short period of time, resources need to be provisioned timely or be available upfront.
> With WebRTC, you can add real-time communication capabilities to your application that works on top of an open standard. It supports video, voice, and generic data to be sent between peers...
It's worth noting that QUIC is also a very complex specification and is only going to get more complex as it continues through the standardization process. In parallel, there's ongoing work on the next generation of the WebRTC spec. [2] (WebRTC-NV also adds complexity. Nothing ever gets simpler.)
My guess is that we're at least three years away from being able to use anything other than HLS and WebRTC in production. And -- pessimistically because I've worked on video for a long time and seen over and over that new stuff always take _forever_ to bake and get adoption, maybe that's going to be more like 10 years.
[1] https://github.com/mengelbart/rtp-over-quic-draft [2] https://www.w3.org/TR/webrtc-nv-use-cases/
QUIC and webtransport can definitely already do DASH/HLS without some of the protocol complexity by using the QuicStreams (but to use QUICs underlying features, DASH/HLS need to change as well).
Some of us wrote a position statement in 2017, see https://datatracker.ietf.org/doc/html/draft-rtpfolks-quic-rt.... There are new documents around media ingest being proposed currently.
Live streaming was a motivating example for both of those, as you can tell from the video. And both of them grew out of our efforts to make WebRTC better for live streaming.
The "future" is going to be a goddamned UDP socket sending compressed media streams across the web. We've reached peak abstraction. We need to come back to first principles, instead of piling on more crap on-top of the browser.
I am a bit skeptical about "down to about 1 second" being achievable with DASH or LL-HLS reliably. Of course, I could be wrong. And a lot depends on the definitions of "about" and "reliably," as well as your user cohort (where they are in the world, etc). :-)
The reason I didn't write much about DASH is that the basic concepts are the same for both DASH and HLS. And my sense for the last couple of years has been that most of the momentum in the ecosystem as taking place around HLS/LL-HLS. But I could be wrong about that, too.
HLS and DASH are similar indeed, but I think the main reason that HLS is still used a lot is that it's the lowest common denominator; you can get it to work everywhere. Perhaps in the future that will be LL-HLS, or something else entirely, but for now most really low latency broadcasting that I'm aware of is using DASH with HLS as a fallback (i.e: CMAF). But I could also be wrong about that of course :-)
Interesting, since it was Apple who designed & built both the original HLS and the LL extensions. What's currently preventing LL from being usable there?
The apples apples comparison here is 0ms (in webrtc, no send side buffering) vs 200ms (in low latency HLS) or 6s (in standard HLS). This is independent of latency of the endpoint from CDN or source.
Another distinction is playback wait time, i.e., how quickly upon joining can it start rendering video.
I’m assuming the full reference picture (typically, an IFrame or a golden frame depending on the codec) in low latency HLS is only available at the start of each 6s segment and not in partial segments. So joining a live stream, the receiving endpoint would have to wait at most 6s before rendering.
Similarly in webrtc, it’s up to the system to generate a reference frame at regular intervals, as low as every second. Or to do it reactively, a receiving endpoint can ask the sender to send a new reference picture. This is done via a Full intra request, the wait time can be as quick as 1.5 times of the round trip time (as new codecs can generate a new iframe instantaneously upon receiving a request). There’s a slight cpu penalty for this which means that the sender getting too many full intra requests may typically throttle the response to 1s.
So Apples to Apples comparison for wait time would be up to 1s for webrtc vs 6s for HLS.
In addition to what vr000m said above, I'll just add that when you make HLS chunks smaller, you're reducing the leverage you get from HLS's core design decisions. I tried to cover some of this in the post.
One way to think about this intuitively is that HLS and WebRTC are opposite ends of one important trade-offs axis.
HLS is about delivering media streams in a way that scales as cost-effectively as possible.
WebRTC is about delivering media frames at the lowest possible latency.
These are very different goals, and given current infrastructure and standards it's not possible to have your cake and eat it too, here. That may change in the future as low-latency video becomes more and more important. QUIC, for example, is a new approaches to building out a full stack that works around some of the fundamental tradeoffs that exist today.
The result is that pushing HLS segments down to 200ms is not at all a clear win. We'll see what happens as HLS implementations improve. And I should say that my brain has been warped by working on UDP/RTP stuff for a long time. But my bet is that using 200ms HLS segments is, for most real-world users, going to make HLS worse in every way than WebRTC would be, for the same use cases. (That's definitely true today with the early implementations of LLHLS.)
1) A playlist of multiple different video files in VLC as a source for ...
2) OBS Studio, which produces RTMP to be consumed by ...
3) nginx, which calls ffmpeg to produce multiple birates and resolutions to be rebundled as HLS to be sent to ...
4) a "live TV" channel of my own specification as an input to JellyFin, which can be read by ...
5) various clients on Roku, Apple TV, Firestick, Chromecast "apps"
At this point, I don't think the industry will ever really settle down to something manageable.
The reason I didn't talk about it in the post is because the basic ideas that go into reducing the latency of any HLS/DASH implementation are the same. Smaller segment sizes, chunked transport, and smart encoding and prefetch/buffering implementations.
But ... while there's lots of terrific engineering underpinning the Twitch approach, there's no way around the real-world limits to that approach. As a result, on good network connections you'll usually get ~2s latency. But not so much on the long tail of network connections. If your use case can gracefully accommodate a range of latencies for the same shared session across different clients, that's fine. If it can't, it's not fine. Plus, I don't think you can get down very much below 2s.
(I don't have access to any Twitch/IVS internal data, so I don't know what latencies they see globally across their user base. But I've done a lot of testing of this kind of stuff in general.)
And, yeah, Apple. Agreed.
For me, DASH and HLS are just manifests or playlist definitions, of how best to find the resource. It is not dissimilar to webrtc signalling, i.e., go to a resource to discover where to get other parts of the resource.
Webrtc uses a few protocols. RTP is very central to it, ICE and SDP are also very important protocol for NATs/firewalls and capability selections. While we use webrtc as a shortcut to refer to the suite of protocols and didn’t call out RTP specifically for that reason.
I assume WebRTC includes STUN/TURN/ICE (negotiated over SIP?) then for traversing NATs? The last time I was really into networking was 2001-ish so that stuff was still around the corner, but I kept up with my reading for a few years after that. I also had some of these acronyms refreshed when setting up Jingle, which uses XMPP instead of SIP, but establishes an RTP connection much like traditional VOIP would use.
If it's the user's last mile connection (between e.g. their home and their ISP), then the big HLS/DASH/etc buffer translates into a lot of time to react. So clients have the option to shift quite low - and do so quickly - if there are some very low bandwidth options, in theory even switching down to an audio-only or nearly-audio-only stream if one is provided, and can also choose to be optimistic/aggressive to resume playback as soon as one full chunk is downloaded - and some implementations will even resume playback when less than a full chunk is downloaded. The client side logic has a lot of latitude here to balance fast start/resume times vs sustaining playback.
When the bottleneck or failure is elsewhere, HLS can be incredibly durable. For extremely high profile events, for example, there are typically multiple CDNs involved, multiple sources going to independent encoders, etc. So an HLS/DASH client might talk to many different servers on a given CDN, as well as servers on alternate CDNs, and even grab what amount to being different copies of the stream spit out by different encoders. It's not uncommon for a client to be testing different CDN endpoints throughout playback to migrate away from congestion automatically.
Partly because big parts of WebRTC are not standardized (session setup signaling, of course, but also in practice lots of necessary state management) it's a little bit hard to imagine how to build an equivalent for WebRTC.
Relatively recently, I would have said that our experience running large-scale WebRTC stuff in production made "core" infrastructure failure relatively low on our list of concerns. The two components of the last mile connection, on the other hand, are always a huge pain point because of the long tail of bad ISPs and bad Wifi setups.
However ... many of us who try to deliver always-available video services got something of a wakeup call in November and December last year when AWS had two pretty big outages two months in a row.
In practice, there is a little more to it. For example, when DRM is enabled, you need some way to preserve the decryption keys. And for live content, the manifest file usually just tells the client about a sliding window of files, so you need a tiny bit of additional client side logic to pay attention to this fact.
One cool thing about DASH/HLS is that you can do some pretty complex mixing of content - you can build a traditional TV-channel like experience that mixes live and prerecorded content, you can replace and inject ads, you can make live content immediately available for on-demand playback, etc.
(The original post is written by Daily's CEO.)
On your own S3, this would be a multipart upload.
Internet video conferencing: we used to have multicast and realtime videoconferencing in the 90s: https://ee.lbl.gov/vic/
Before that we had analog videoconferencing, eg AT&T Picturephone service went GA in 1970.