Aside from latency (which isn't much of a problem unless you are competing with TV or some other distribution system), it seems easier than on-demand, since you send the same data to everyone and don't need to handle having a potentially huge library in all datacenters (you have to distribute the data, but that's just like having an extra few users per server).
My guess is that the problem was simply that the number of people viewing Netflix at once in the US was much larger than usual and higher than what they could scale too, or alternatively a software bug was triggered.
Live content is harder because it can't really be cached, nor, due to TLS, can you really serve everyone the same stream. I think the hardest problem to solve is provisioning. If you are expecting 1 million users, and 700,000 of them get routed to a single server, that server will begin to struggle. This can happen in a couple different ways - for example an ISP who isn't a large consumer normally, suddenly overloads its edge server. Even though your DC can handle the traffic just fine, the links between your DC and the ISP begin to suffer, and since the even is live, it's not like you can just wait until the cache is filled downstream.
isn't it a tree of cache servers? as origin sends the frames they're cached.
and as load grows the tree has to grow too, and when it cannot resorting to degrading bitrate, and ultimately to load shedding to keep the viewers happy?
and it seems Netflix opted to forego the last one to avoid a the bad PR of an error message of "we are over capacity" and instead went with actually let it burn, no?
When I mean "cached", it means that the PoP server can serve content without contacting the origin server. (The PoP can't serve content it does not have).
>and it seems Netflix opted to forego the last one to avoid a the bad PR of an error message of "we are over capacity" and instead went with actually let it burn, no?
Anything other than 100% uptime is bad PR for Netflix.
With on-demand you can push the episodes out through your entire CDN at your leisure. It doesn't matter if some bottleneck means it takes 2 hours to distribute a 1 hour show worldwide, if you're distributing it the day before. And if you want to test, or find something that needs fixing? You've got plenty of time.
And on-demand viewers can trickle in gradually - so if clients have to contact your DRM servers for a new key every 15 minutes, they won't all be doing it at the same moment.
And if you did have a brief hiccup with your DRM servers - could you rely on the code quality of abandonware Smart TV clients to save you?
People using over the air antennas get it “live“. Getting it from cable or a streaming service meant anywhere between a few seconds and over a minute of delay.
It was absolutely common to have a friend text you about something that just happened when you haven’t even seen it yet.
You can’t even say that $some_service is fast, some of them vary over 60 seconds just between their own users.
I'd imagine with on-demand services you already have the full content and therefore can use algorithms to compress frames and perform all kinds of neat tricks to.
With live streaming I'd imagine a lot of these algorithms are useless as there isn't enough delay & time to properly use them, so they're required to stream every single pixel and maybe some JIT algorithms
But in either case, you can put that stuff on your CDN days ahead of time. You can choose to preload it in the cache because you know a bunch of people are gonna want it. You also know that not every single individual is going to start at the exact same time.
For live, every single person wants every single bite at the same time and you can’t preload anything. Brutal.
* Encoding - low latency encoders are quite different than storage encoders. There is a tradeoff to be made in terms of the frequency of key frames vs. overall encoding efficiency. More key frames means that anyone can tune in or recover from a loss more quickly, but it is much less efficient, reducing quality. The encoder and infrastructure should emit transport streams, which are also less efficient but more reliable than container formats like mp4.
* Adaptation - Netflix normally encodes their content as a ladder of various codecs and bitrates. This ensures that people get roughly the maximum quality that their bandwidth will allow without buffering. For a live event, you need the same ladder, and the clients need to switch between rungs invisibly.
* Buffering - for static content, you can easily buffer 30 seconds to a minute of video. This means that small latency or packet loss spikes are handled invisibly at the transport/buffering layer. You can't do this for a live event, since that level of delay would usually be unacceptable for a sporting event. You may only be able to buffer 5-10 seconds. If the stream starts to falter, the client has only a few seconds to detect and shift to a lower rung.
* Transport - Prerecorded media can use a reliable transport like TCP (usually HLS). In contrast, live video would ideally use an unreliable transport like UDP, but with FEC (forward error correction). TCP's reaction to packet loss halves the receive window, which halves bandwidth, which would have to trash the connection to shift to a lower bandwidth rung.
* Serving - pre-recorded media can be synchronized to global DCs. Live events have to be streamed reliably and redundantly to a tree of servers. Those servers need to be load balanced, and the clients must implement exponential backoff or you can have cascading failures.
* Timing - Unlike pre-recorded media, any client that has a slightly fast clock will run out of frames and either need to repeat frames and stretch audio, or suffer glitches. If you resolve this on the server side by stretching the media, you will add complication and your stream will slowly get behind the live event.
* DVR - If you allow the users to pause, rewind, catch up, etc., you now have a parallel pre-recorded infrastructure and the client needs to transition between the two.
* DRM - I have no idea how/if this works on a live stream. It would not be ideal that all clients use the same decryption keys and have the same streams with the same metadata. That would make tracing the source of a pirate stream very difficult. Differentiation/watermarking adds substantial complexity, however.