The real failure rate of EBS (opens in new tab)

(planetscale.com)

113 pointsQuinnyPig1y ago30 comments

30 comments

29 comments · 7 top-level

samlambert1y ago· 9 in thread

we have a lot more content like this on the way. if anyone has feedback or questions let us know.

swyx1y ago

LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!

JoshTriplett1y ago

How often do you boot up instances? Do you measure detailed metrics for the time from the RunInstances call to the earliest possible timestamp you can easily get from the user code, to quantify the amount of time spent in AWS before any instance code gets control?

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!

miller_joe1y ago

Instances are constantly booting up because most instances live <30d. Boot time in terms of how soon a node is fully booted and joined to the EKS apiserver and ready for workloads is approx 2.5-3min. There are lot of parts involved in getting to this point though, some of which would not matter if you're not using EKS. Also this is not something we measure super closely as from a user perspective it is generally imperceptible.

A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...

1 more reply

bigfatfrock1y ago

Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?

samlambert1y ago

it's hard to know for sure. it might be that or it might just present a number that is confusing to most.

ta9881y ago

Thanks! This is extremely useful and I'll be waiting for the next ones.

flaminHotSpeedo1y ago

Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you

nickvanw1y ago

Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.

kingnaldo1y ago

Love how educational it is. I'd love even more if formulas were included for the statistics calculations.

reedf11y ago· 7 in thread

If you can detect EBS failure better than Amazon - I'd be selling this to them tomorrow.

tpetry1y ago

They probably detect this. Thats why the problem is solved after one to ten minutes according to the article. There's probably nothing they can do which wouldn't stress the disks more.

diggan1y ago

Probably sometimes, at least if we trust the article:

> In our experience, the documentation is accurate: sometimes volumes pass in and out of their provisioned performance in small time windows:

What AWS consider "small degradation" is sometimes "100% down" for their users though, look at any previous "AWS is down/having problems" HN comment threads and you'll see there tends to be a huge mismatch between what AWS considers "not working" and what users of AWS considers "not working".

Doesn't surprise me people want better tooling than what AWS themselves offer.

nickvanw1y ago

Author here - it's not that we're detecting failure better than they are (though certainly, we might be able to do it as fast as anyone else) - it's what you do afterwards that matters.

Being able to fail over to another database instance backed by a different volume in a different zone allows for a minimization of impact. This is well inline with AWS best practices, it's just arduous to do quickly and at-scale.

sougou1y ago

It's not just failure detection. A write to EBS is at least two additonal network hops. The first one is to get to the machine for the initial write, and the second is for that write to be propagated to another machine for durability. Multiply this by the number of IOPS required to complete a database transaction.

dijit1y ago

Why? They wouldn't buy it.

No offence to anyone who has drank the kool-aid with AWS, but honestly they're making a product *not* foundational infrastructure.

This might feel like a jarring point.

When you think of foundational infrastructure in the real world you think bridges and plumbing and the costs of building such things; which is stupidly high.

Yet when those things get grossly privatised they end up like Lagos, Nigeria[0].

Because there is a difference between delivering something that works most of the time, and something that works all of the time -- Major point being: one of them is obscenely profitable, and the other one might not even break even, which is why governments usually take on the cost of foundational infrastructure: They never expect to even break-even.

[0]: https://ourworld.unu.edu/en/water-privatisation-a-worldwide-...

flaminHotSpeedo1y ago

I think the more interesting part here (besides the fact that AWS SLA's sneakily screw you over and make it hard to guarantee static stability) is the remediation aspect.

This is a consistent letdown across most AWS products; they build the undifferentiated 90% of a thing, but some PM refuses to admit their product isn't complete, so instead of having optional features flags or cdk samples or something to help with that last 10%, they bury it deep in the docs and try not to draw attention to it. Then when you open a support case they tell you to pound sand, or maybe suggest rearchitecting to avoid their foot-gun they didn't tell you about.

bddicken1y ago

Or in this case, to spend far more $$ on io2.

mstaoru1y ago· 3 in thread

"What makes PlanetScale Metal performance so much better? With your storage and compute on the same server, you avoid the I/O network hops that traditional cloud databases require [...] Every PlanetScale database requires at least 2 replicas in addition to the primary. Semi-synchronous replication is always enabled. This ensures every write has reached stable storage in two availability zones before it’s acknowledged to the client."

Isn't there a contradiction between these two statements?

My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).

samlambert1y ago

it's not a contradiction but there is nuance. local disks mean we can do a significant amount of the operations involved in a write locally without every block going over the network. It's true that a replica has to acknowledge it received the write but that's a single operation vs hundreds over a network.

rfoo1y ago

> that every disk will experience a fatal failure or a disconnection at least once a month

When? I vaguely remember that it used to be like that, but I haven't seen nearly as many failures on Aliyun for the last few years.

mstaoru1y ago

Admittedly, I moved everything off around 2020. We do have a few smaller "on-prem" style installs for a few customers, with much less traffic than our main installation. One of the Aliyun installs does experience block device issues from time to time, but since I effectively RAID them, it goes unnoticed, though it doubles the price. But the install is small (<1TB data) so it's not a problem.

jewel1y ago· 2 in thread

I wonder if you could work around this problem by having two EBS volumes on each host, and write to them both. You'd have the OS report the write was successful as soon as either drive reported success. With reads you could alternate between drives for double the read performance during happy times, but quickly detect when one drive is slow and reroute those reads to the other drive.

We could call this RAID -1.

You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.

Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.

maherbeg1y ago

The blog post mentioned correlated failures in an availability zone. You likely could reduce this a bit, but still run into this frequently enough

samlambert1y ago

it's a lot of complexity and cost for a service that is already replicating 3 ways. 6x replication for a single node's disks seems excessive.

c4wrd1y ago· 1 in thread

> When attached to an EBS–optimized instance, General Purpose SSD (gp2 and gp3) volumes are designed to deliver at least 90 percent of their provisioned IOPS performance 99 percent of the time in a given year. This means a volume is expected to experience under 90% of its provisioned performance 1% of the time. That’s 14 minutes of every day or 86 hours out of the year of potential impact. This rate of degradation far exceeds that of a single disk drive or SSD. > This is not a secret, it's from the documentation. AWS doesn’t describe how failure is distributed for gp3 volumes, but in our experience it tends to last 1-10 minutes at a time. This is likely the time needed for a failover in a network or compute component. Let's assume the following: Each degradation event is random, meaning the level of reduced performance is somewhere between 1% and 89% of provisioned, and your application is designed to withstand losing 50% of its expected throughput before erroring. If each individual failure event lasts 10 minutes, every volume would experience about 43 events per month, with at least 21 of them causing downtime!

These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.

Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day -- and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.

As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.

remram1y ago

They are expanding on what the guarantee from AWS means, their statement is correct. They did not say the pizza place does this, they said the pizza place's guarantee allows for this. I don't see a problem.

QuinnyPigOP1y ago

I'm a sucker for deep dive cloud nerd content like this.

semi-extrinsic1y ago

Funny to see the plots with "No unit" on the y-axis label and then the actual units in parentheses in the title.

j / k navigate · click thread line to collapse

30 comments

29 comments · 7 top-level

samlambert1y ago· 9 in thread

we have a lot more content like this on the way. if anyone has feedback or questions let us know.

swyx1y ago

LOVE this stuff sam, its highly educational but also establishes a ton of trust in PS. please keep it up!

JoshTriplett1y ago

If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.

Thanks for the analysis and article!

miller_joe1y ago

1 more reply

bigfatfrock1y ago

Great deep dive, I've been actively curious about some of the results you found that present themselves similarly in infra setups I run or have run previously.

This kind of miffs also:

> AWS doesn’t describe how failure is distributed for gp3 volumes

I wonder why? Because it affects their number of 9s? Rep?

samlambert1y ago

it's hard to know for sure. it might be that or it might just present a number that is confusing to most.

ta9881y ago

Thanks! This is extremely useful and I'll be waiting for the next ones.

flaminHotSpeedo1y ago

Do you listen for volume degradation EventBridge notifications? I'm curious if or how often AWS flags these failed volumes for you

nickvanw1y ago

Our experience has been that they do fire, but not reliably enough or soon enough to be worth anything other than validating the problem after the fact.

kingnaldo1y ago

Love how educational it is. I'd love even more if formulas were included for the statistics calculations.

reedf11y ago· 7 in thread

If you can detect EBS failure better than Amazon - I'd be selling this to them tomorrow.

tpetry1y ago

They probably detect this. Thats why the problem is solved after one to ten minutes according to the article. There's probably nothing they can do which wouldn't stress the disks more.

diggan1y ago

Probably sometimes, at least if we trust the article:

> In our experience, the documentation is accurate: sometimes volumes pass in and out of their provisioned performance in small time windows:

Doesn't surprise me people want better tooling than what AWS themselves offer.

nickvanw1y ago

Author here - it's not that we're detecting failure better than they are (though certainly, we might be able to do it as fast as anyone else) - it's what you do afterwards that matters.

sougou1y ago

dijit1y ago

Why? They wouldn't buy it.

No offence to anyone who has drank the kool-aid with AWS, but honestly they're making a product *not* foundational infrastructure.

This might feel like a jarring point.

When you think of foundational infrastructure in the real world you think bridges and plumbing and the costs of building such things; which is stupidly high.

Yet when those things get grossly privatised they end up like Lagos, Nigeria[0].

[0]: https://ourworld.unu.edu/en/water-privatisation-a-worldwide-...

flaminHotSpeedo1y ago

I think the more interesting part here (besides the fact that AWS SLA's sneakily screw you over and make it hard to guarantee static stability) is the remediation aspect.

bddicken1y ago

Or in this case, to spend far more $$ on io2.

mstaoru1y ago· 3 in thread

Isn't there a contradiction between these two statements?

samlambert1y ago

rfoo1y ago

> that every disk will experience a fatal failure or a disconnection at least once a month

When? I vaguely remember that it used to be like that, but I haven't seen nearly as many failures on Aliyun for the last few years.

mstaoru1y ago

jewel1y ago· 2 in thread

We could call this RAID -1.

Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.

maherbeg1y ago

The blog post mentioned correlated failures in an availability zone. You likely could reduce this a bit, but still run into this frequently enough

samlambert1y ago

it's a lot of complexity and cost for a service that is already replicating 3 ways. 6x replication for a single node's disks seems excessive.

c4wrd1y ago· 1 in thread

remram1y ago

QuinnyPigOP1y ago

I'm a sucker for deep dive cloud nerd content like this.

semi-extrinsic1y ago

Funny to see the plots with "No unit" on the y-axis label and then the actual units in parentheses in the title.

j / k navigate · click thread line to collapse