Keeping Netflix Reliable Using Prioritized Load Shedding (opens in new tab)

(netflixtechblog.com)

239 pointsjakejarvis5y ago73 comments

73 comments

56 comments · 9 top-level

bklyn112015y ago· 23 in thread

Totally anecdotal evidence, but I was in a rural NY house served by DSL for the past 6 months. The DSL has consistent packet loss between 4 and 6%. The only video service that could handle this level of packet loss well was Amazon Prime. Netflix couldn't even load its browse screen until the past two weeks, where something changed, and suddenly Netflix could handle the high packet loss as well as Amazon Prime.

Thank you to the engineers and developers!

madeofpalk5y ago

Seperate annecdote - I worked on an inflight satelite wifi project and I was surprised at how well both Youtube and Netlifx worked over a medium-bandwidth/high-latency connection.

Granted, we had specific QoS/traffic shaping to improve reliability without gobbling up all the bandwidth (stream Netflix was an advertised feature of the wifi service), but it still seemed like magic.

seized5y ago

When Plex rolled out it's auto quality/auto bandwidth adjustment it actually worked very well over airplane satellite wifi as well. I watched a few things from my own server.

I'm amazed that service allowed streaming though...

1 more reply

kalleboo5y ago

YouTube has gotten way better in the past couple of years. When they first launched DASH streaming, it was terrible on high-latency international connections. If a US-based content creator uploaded a video and you were the first to view it in your region, you could actually notice how it was populating the CDN and it was unwatchable without disabling DASH and using the old-fashioned buffered player. These days it's flawless for me in nearly every situation.

closeparen5y ago

Wow! You actually stream Netflix to an airplane? I always guessed that inflight VOD services had the movies stored in a server on the plane.

2 more replies

ComputerGuru5y ago

This sounds like an MTU issue. TCP takes care of mere (eg probabilistic) packet loss ok. MTU issues have actually crept back up because TLS exacerbates any underlying MTU problems. IPv6 doubly so (when any hops - especially yours - don’t follow path MTU detection requirements).

marta_morena_285y ago

TCP doesn't take care of packet loss. What TCP does is make sure your packets are not lost, even if you have 99% packet loss. On the flip-side, that means that if TCP can't deliver a single packet (say out of a billion), the whole stream stops at this one packet...

Which is why TCP is a horrible choice for any streaming service and a horrible choice for lossy connections, and I would be quite surprised if Netflix relied on it. UDP is the perfect choice for streaming, since video decoders can handle packet loss pretty well. The rest you can achieve with good tradeoff between Reed-Solomon codes and key framing.

3 more replies

stingraycharles5y ago

Aren’t MTU issues typically only up to a router? As in, even if the parent had a different MTU than Netflix uses, it wouldn’t matter since their router or the ISP’s router will transform packets between their appropriate MTUs?

And if this is true, then how could it be that Amazon works without problem and Netflix doesn’t?

2 more replies

xxpor5y ago

>TCP takes care of mere (eg probabilistic) packet loss ok.

I'd imagine this is largely due to MSS clamping rather than actual MTU caused packet loss.

legulere5y ago

Isn’t streaming done usually via UDP?

2 more replies

ikiris5y ago

... no it doesn't. Like not even close.

crazygringo5y ago

> Netflix couldn't even load its browse screen until the past two weeks

I assume the browse screen is based entirely on TCP?

I'm struggling to understand why packet loss would prevent it from loading -- it should be slower but TCP should handle re-transmission, no?

Or is Netflix doing something tricky with UDP even in their browsing UX?

kevincox5y ago

If I had to guess they probably had timeouts that were too aggressive. Client timeouts are a very hard problem because it is difficult to tell the difference between "working, but slowly" and "something went wrong, the best bet is to try again".

Back in the day we used to have timeouts based on individual reads/writes which will often better answer "is this HTTP request making progress". However the problem with these sort of timeouts is they don't compose well so most people end up having an end-to-end deadline.

ReactiveJelly5y ago

I doubt Netflix is doing anything tricky with UDP anywhere in their stack.

QUIC doesn't count because it's not tricky.

I'd love to see a source for this but seeing as YouTube works great over regular HTTP and TCP, I doubt anyone else is out in the weeds trying some custom UDP solution and reinventing wheels.

timhaak5y ago

Slightly unrelated but does the packet loss happen all the time or when close to maximum of the line.

Used to have similar problems with an ADSL line but found if I limited the line (Both up and down) I could find a magic number where the packet loss went away. (Well most of the time :))

Though it did need to be tuned for different times of they . ie high congestion times need it to be lower.

Though technically it shouldn't be your problem :(

baq5y ago

This is normal if your router doesn’t prioritize control traffic. A rate limit allows all the ACKs to normally leave your network instead of getting queued up.

1 more reply

bklyn112015y ago

It happens nearly all the time. We use very little DSL bandwidth but are quite rural (miles from primary telephone infrastructure)

fkskdkgif5y ago

Dropped packets are often a symptom that the MTU value is set too high. That would be uncorrelated to congestion, though.

tyrust5y ago

I'd believe it. When you know that there is going to be packet loss (whether from the user's spotty internet or from internal load-shedding), building your applications to be as resilient as possible to it makes sense. The infrastructure experimentation platform mentioned in the article is probably helpful for sniffing out potential trouble-spots in applications.

epc5y ago

Any chance there weren't any line filters on the POTS equipment? I haven't had DSL in years but when I did I had to have filters on any telephone devices connected to the same line.

umbs5y ago

How did you measure 4 to 6% packet loss? Do you have scripts to ping some server and you are collecting packet loss data? I would like to collect such data for my home network and am curious.

wtallis5y ago

Smokeping is one of the better-known tools for tracking latency and loss over time: https://oss.oetiker.ch/smokeping/

lostmsu5y ago

Simple ping command actually prints statistics

gameswithgo5y ago

there has been the good kind of capitalism going on between the video streaming services before. earlier on I remember netflix was way better than amazon, but amazon upped their game since.

lacker5y ago· 5 in thread

I love the phrase "Prioritized Load Shedding" as corporate-speak for "dropping the less-important traffic."

therealdrag05y ago

How is it corporate-speak? Sounds just like standard thoughtful naming. If I was working on a module that did this I would be happy to name it this even if it never got mentioned in any corporate context.

dboreham5y ago

Dropping some traffic to avoid complete melt down.

dkarp5y ago

A lot of which is non-essential as well as NON_CRITICAL, like tracking and telemetry

tofuahdude5y ago

Would you actually name this feature "drop less important traffic"?

kbar135y ago

engineer-speak for dropping requests that are less noticeable to users when dropped in order to give services room to recover

RyanShook5y ago· 5 in thread

How much better would the world be if the Netflix engineering team tackled real-world problems instead of making sure we can all binge watch Stranger Things? Such a smart group of people.

erulabs5y ago

How do you know what a real-world problem is or isn’t? Avoid the intellectual trap of trying to socially engineer the world to your tastes. Watching stranger things might be more important than you realize.

nhumrich5y ago

Most companies have smart engineers. The difference is how they prioritize tasks. Most companies spend a significant amount of time building new features. Netflix is pretty low in terms of number of features. Their core product is uptime and reliability. So the result is that they _have_ to reinvent scalable solutions because the rest of us are accepting production issues for more time to build out features. What you see is the result of laser focus on uptime. Not even AWS cares about uptime as much as netflix.

frakkingcylons5y ago

The world’s problems are not the result of some software engineers deciding to work on a video streaming platform.

stormbeard5y ago

Specifically, what would you suggest applying this skill-set to?

asdff5y ago

A local start, given Netflix has real estate in LA: using these skills to develop better consumption services for unreliable internet users toward developing better tools for distance learning for LAUSD. Lots of students are falling behind thanks to issues with poor unreliable connectivity due to the expenses associated with having broad band internet. In march, 17% of LAUSD families had no internet at home. Today, more have bought internet plans, but given the job lossess suffered by the working poor, these are probably the cheapest internet packages available. Given how JS heavy these online education sites are, this is a poor experience for students who are already dealing with the stresses associated with being on the bottom of the economic totem pole in a high cost of living city.

https://www.latimes.com/california/story/2020-09-21/online-l...

1 more reply

londons_explore5y ago· 4 in thread

So you can drop all that traffic and the users are unimpacted... So why not just always drop that traffic and don't even bother writing the code to implement those features that clearly nobody cares about?

redblacktree5y ago

It's not that nobody cares; they're just less important. Obvious example is analytics. It helps Netflix deliver a better experience to their users, (video watchers and product placement advertisers -- whoever) but it isn't mission critical to making sure users can watch their content. If you miss some log data, that's not a huge deal. That doesn't mean you want to stop logging.

Some of the things they mentioned were also user impacting, like not being able to select a video's language, but less critical. You obviously still want that feature, but it's less important than being able to watch at all.

therealdrag05y ago

That's like saying "if you can drive a car without an air-conditioner, why even bother installing an air-conditioner?"

tofuahdude5y ago

"unimpacted" isnt the case - it makes clear that some traffic is more critical-path than others and thus the prioritization.

"Clearly nobody cares about" - what? The whole point here is "people care most about video streaming" and less about the metadata etc that they lower in priority.

notanotherycom15y ago

Great idea! Drop add to list, the searchbar and streaming audio in a different language! Noone cares about any of that stuff

tmpz225y ago· 3 in thread

A lot of websites will now fail requests early based on a timeout, forcing users to refresh the page. I have to wonder if ad-based sites enjoy this behavior because it could lead to more ad impressions. Talking about you reddit.

ComputerGuru5y ago

I think you’re talking about SPAs in specific. Many have race conditions in frontend code that are not revealed on fast connections or when all resources are loaded with the same speed/consistency. Open the developer console next time it happens, I bet you’ll find a “foo is not a function” or similar error caused by something not having init yet and the code not properly awaiting it. If an SPA core loop errors out, load will be halted or even a previously loaded or partially loaded page will become blank or partially so. Refreshing it will load already retrieved resources from cache and often “fixes” the problem.

tmpz225y ago

You see it in backend code too. For example Golang's context.WithTimeout is used to time out http requests and database calls that may be taking too long. This is particularly irksome with microservices where multiple services are running timeouts that interfere with one another.

It is becoming du jour to quell 99 percentile latency spikes (i.e. 1:100 requests will take substantially longer) by terminating the requests, which may not always be in the best interest of the user even if it is convenient for the devops teams and their promotion packets.

2 more replies

anitil5y ago

It's surprising to me how slow reddit is on mobile. If only there was a way of serving content so that the browser can start to render before the full payload has been served.

kache_5y ago· 2 in thread

When deciding what mechanism to employ to load shed, you should keep in mind the layer at which you are load shedding. Modern distributed systems are comprised of many layers. You can do it at the load balancer, at the OS level, or in the application logic. This becomes a trade-off. As you get closer to the core application logic, the more information you will have to make a decision. On the other hand, as you get closer, the more work you have already performed and the more cost there is to throwing away the request.

You may employ techniques more complex than a simple bucketing mechanism, such as acutely observing the degree at which clients are exceeding their baseline. However, these techniques aren’t free. The cost of simply throwing away the request can overwhelm your server - and the more steps you add before the shedding part the lower the maximum throughput you can tolerate before going to 0 availability. It’s important to understand at what point this happens when designing a system that takes advantage of this technique.

For example, If you do it at the OS level, it is a lot cheaper than leaving it to the server process. If you choose to do it in your application logic, think carefully about how much work is done for the request before it gets thrown away. Are you validating a token before you are making your decision?

jeffbee5y ago

You touch on the key thing that people sometimes overlook. Whatever you are doing to serve errors has to be strictly less expensive than serving successes. If your load shedding error path does things like logging synchronously to a file (as you might get from a logging library that synchronizes outputs for warnings and errors, but not information), taking a lock to update a global error counter, or formatting stack traces in exceptions, it's possible that load shedding will _cause_ the collapse of your service instead of preventing it.

joatmon-snoo5y ago

+1 additionally, if you end up in a scenario where you don't even have enough capacity in a given layer to fail quickly, your only options are either increase capacity or throttle load pre-server (either in the network or clients)

annoyingnoob5y ago· 2 in thread

According to the diagram, Netflix is injecting chaos into the chaos control panel. Is that right?

Looks like the arrow goes the wrong direction.

khalilravanna5y ago

I’m wondering if it’s more of a “Chaos Injector” component/service that reads configuration data from the Chaos Control Plane on what to target, with parameters on how/when to do so. That would make the arrow make sense in my mind given it sounds like that’s a solid pattern for scaling these data/control plane flows: https://aws.amazon.com/builders-library/avoiding-overload-in...

TheSwordsman5y ago

This. It's an internal system called ChAP, Chaos Automation Platform. It has the ability to target failure down to specific RPC calls in single instances, using platform components that services consume as the mechanism for doing that injection.

abalone5y ago· 2 in thread

For me this link just opens the Medium app and fails to load the article. I had to force it to open in a browser.

Seems like a pretty bad Medium bug.

Cthulhu_5y ago

Seems like pretty standard browser/app handover behaviour to me, although the app not working is a massive fail and should - hopefully - flag up automatically as a critical issue on Medium's side.

ComputerGuru5y ago

Obvious suggestion but not made in snark: uninstall the medium app? I’ve had to do that for lots of poorly developed apps or apps developed not in sync with the web frontend.

Edit: it is a bad link and I can see why this would happen if you had the Medium app installed. It’s a “branded” Medium post (i.e. appears on the Netflix-owned domain) but clicking the link redirects you to medium.com then redirects you back to the cname.

herodoturtle5y ago· 1 in thread

Hah.

"Load Shedding".

Shout-out to my fellow South Africans.

perryizgr85y ago

And Indians.

j / k navigate · click thread line to collapse

73 comments

56 comments · 9 top-level

bklyn112015y ago· 23 in thread

Thank you to the engineers and developers!

madeofpalk5y ago

Seperate annecdote - I worked on an inflight satelite wifi project and I was surprised at how well both Youtube and Netlifx worked over a medium-bandwidth/high-latency connection.

seized5y ago

When Plex rolled out it's auto quality/auto bandwidth adjustment it actually worked very well over airplane satellite wifi as well. I watched a few things from my own server.

I'm amazed that service allowed streaming though...

1 more reply

kalleboo5y ago

closeparen5y ago

Wow! You actually stream Netflix to an airplane? I always guessed that inflight VOD services had the movies stored in a server on the plane.

2 more replies

ComputerGuru5y ago

marta_morena_285y ago

3 more replies

stingraycharles5y ago

And if this is true, then how could it be that Amazon works without problem and Netflix doesn’t?

2 more replies

xxpor5y ago

>TCP takes care of mere (eg probabilistic) packet loss ok.

I'd imagine this is largely due to MSS clamping rather than actual MTU caused packet loss.

legulere5y ago

Isn’t streaming done usually via UDP?

2 more replies

ikiris5y ago

... no it doesn't. Like not even close.

crazygringo5y ago

> Netflix couldn't even load its browse screen until the past two weeks

I assume the browse screen is based entirely on TCP?

I'm struggling to understand why packet loss would prevent it from loading -- it should be slower but TCP should handle re-transmission, no?

Or is Netflix doing something tricky with UDP even in their browsing UX?

kevincox5y ago

ReactiveJelly5y ago

I doubt Netflix is doing anything tricky with UDP anywhere in their stack.

QUIC doesn't count because it's not tricky.

I'd love to see a source for this but seeing as YouTube works great over regular HTTP and TCP, I doubt anyone else is out in the weeds trying some custom UDP solution and reinventing wheels.

timhaak5y ago

Slightly unrelated but does the packet loss happen all the time or when close to maximum of the line.

Used to have similar problems with an ADSL line but found if I limited the line (Both up and down) I could find a magic number where the packet loss went away. (Well most of the time :))

Though it did need to be tuned for different times of they . ie high congestion times need it to be lower.

Though technically it shouldn't be your problem :(

baq5y ago

This is normal if your router doesn’t prioritize control traffic. A rate limit allows all the ACKs to normally leave your network instead of getting queued up.

1 more reply

bklyn112015y ago

It happens nearly all the time. We use very little DSL bandwidth but are quite rural (miles from primary telephone infrastructure)

fkskdkgif5y ago

Dropped packets are often a symptom that the MTU value is set too high. That would be uncorrelated to congestion, though.

tyrust5y ago

epc5y ago

Any chance there weren't any line filters on the POTS equipment? I haven't had DSL in years but when I did I had to have filters on any telephone devices connected to the same line.

umbs5y ago

How did you measure 4 to 6% packet loss? Do you have scripts to ping some server and you are collecting packet loss data? I would like to collect such data for my home network and am curious.

wtallis5y ago

Smokeping is one of the better-known tools for tracking latency and loss over time: https://oss.oetiker.ch/smokeping/

lostmsu5y ago

Simple ping command actually prints statistics

gameswithgo5y ago

there has been the good kind of capitalism going on between the video streaming services before. earlier on I remember netflix was way better than amazon, but amazon upped their game since.

lacker5y ago· 5 in thread

I love the phrase "Prioritized Load Shedding" as corporate-speak for "dropping the less-important traffic."

therealdrag05y ago

dboreham5y ago

Dropping some traffic to avoid complete melt down.

dkarp5y ago

A lot of which is non-essential as well as NON_CRITICAL, like tracking and telemetry

tofuahdude5y ago

Would you actually name this feature "drop less important traffic"?

kbar135y ago

engineer-speak for dropping requests that are less noticeable to users when dropped in order to give services room to recover

RyanShook5y ago· 5 in thread

How much better would the world be if the Netflix engineering team tackled real-world problems instead of making sure we can all binge watch Stranger Things? Such a smart group of people.

erulabs5y ago

nhumrich5y ago

frakkingcylons5y ago

The world’s problems are not the result of some software engineers deciding to work on a video streaming platform.

stormbeard5y ago

Specifically, what would you suggest applying this skill-set to?

asdff5y ago

https://www.latimes.com/california/story/2020-09-21/online-l...

1 more reply

londons_explore5y ago· 4 in thread

redblacktree5y ago

therealdrag05y ago

That's like saying "if you can drive a car without an air-conditioner, why even bother installing an air-conditioner?"

tofuahdude5y ago

"unimpacted" isnt the case - it makes clear that some traffic is more critical-path than others and thus the prioritization.

"Clearly nobody cares about" - what? The whole point here is "people care most about video streaming" and less about the metadata etc that they lower in priority.

notanotherycom15y ago

Great idea! Drop add to list, the searchbar and streaming audio in a different language! Noone cares about any of that stuff

tmpz225y ago· 3 in thread

ComputerGuru5y ago

tmpz225y ago

2 more replies

anitil5y ago

It's surprising to me how slow reddit is on mobile. If only there was a way of serving content so that the browser can start to render before the full payload has been served.

kache_5y ago· 2 in thread

jeffbee5y ago

joatmon-snoo5y ago

annoyingnoob5y ago· 2 in thread

According to the diagram, Netflix is injecting chaos into the chaos control panel. Is that right?

Looks like the arrow goes the wrong direction.

khalilravanna5y ago

TheSwordsman5y ago

abalone5y ago· 2 in thread

For me this link just opens the Medium app and fails to load the article. I had to force it to open in a browser.

Seems like a pretty bad Medium bug.

Cthulhu_5y ago

Seems like pretty standard browser/app handover behaviour to me, although the app not working is a massive fail and should - hopefully - flag up automatically as a critical issue on Medium's side.

ComputerGuru5y ago

Obvious suggestion but not made in snark: uninstall the medium app? I’ve had to do that for lots of poorly developed apps or apps developed not in sync with the web frontend.

herodoturtle5y ago· 1 in thread

Hah.

"Load Shedding".

Shout-out to my fellow South Africans.

perryizgr85y ago

And Indians.

j / k navigate · click thread line to collapse