If so, I'd love to see your measured distribution of boot times. Because I've observed results similar to your observations on EBS, with some long-tail outliers.
Thanks for the analysis and article!
A possibly better metric for your particular case (assuming you're interested in fastest bootup possibly achievable) is from our self-managed github-actions runners. Those boot times are in the 40-50s range. This is consistent with what others see, as far as I know. A good blog on this topic - including how they got boot-to-ready times down to 5s - that you might be interested in from the depot.dev folks: https://depot.dev/blog/github-actions-breaking-five-second-b...
This kind of miffs also:
> AWS doesn’t describe how failure is distributed for gp3 volumes
I wonder why? Because it affects their number of 9s? Rep?
> In our experience, the documentation is accurate: sometimes volumes pass in and out of their provisioned performance in small time windows:
What AWS consider "small degradation" is sometimes "100% down" for their users though, look at any previous "AWS is down/having problems" HN comment threads and you'll see there tends to be a huge mismatch between what AWS considers "not working" and what users of AWS considers "not working".
Doesn't surprise me people want better tooling than what AWS themselves offer.
Being able to fail over to another database instance backed by a different volume in a different zone allows for a minimization of impact. This is well inline with AWS best practices, it's just arduous to do quickly and at-scale.
No offence to anyone who has drank the kool-aid with AWS, but honestly they're making a product *not* foundational infrastructure.
This might feel like a jarring point.
When you think of foundational infrastructure in the real world you think bridges and plumbing and the costs of building such things; which is stupidly high.
Yet when those things get grossly privatised they end up like Lagos, Nigeria[0].
Because there is a difference between delivering something that works most of the time, and something that works all of the time -- Major point being: one of them is obscenely profitable, and the other one might not even break even, which is why governments usually take on the cost of foundational infrastructure: They never expect to even break-even.
[0]: https://ourworld.unu.edu/en/water-privatisation-a-worldwide-...
This is a consistent letdown across most AWS products; they build the undifferentiated 90% of a thing, but some PM refuses to admit their product isn't complete, so instead of having optional features flags or cdk samples or something to help with that last 10%, they bury it deep in the docs and try not to draw attention to it. Then when you open a support case they tell you to pound sand, or maybe suggest rearchitecting to avoid their foot-gun they didn't tell you about.
Isn't there a contradiction between these two statements?
My personal experience with EBS analogs in China (Aliyun, Tencent, Huawei clouds) is that every disk will experience a fatal failure or a disconnection at least once a month, at any provisioned IOPS. I don't know what makes them so bad, but I gave up running any kinds of DB workloads on them, using node local storage instead. If there are durability constrains, I would spin up Longhorn or Rook on top of local storage. I can see replicas degrade from time to time, but overall systems work (nothing too large, maybe ~50K QPS).
When? I vaguely remember that it used to be like that, but I haven't seen nearly as many failures on Aliyun for the last few years.
We could call this RAID -1.
You'd need some accounting to ensure that the drives are eventually consistent, but based on the graphs of the issue it seems like you could keep the queue of pending writes in RAM for the duration of the slowdown.
Of course, it's quite likely that there will be correlated failures, as the two EBS volumes might end up on the same SAN and set of physical drives. Also it doesn't seem worth paying double for this.
These are some seriously heavy-handed assumptions being made, completely disregarding the data they collect. First, the author assumes that these failure events are distributed randomly and expected to happen on a daily basis, ignoring Amazon's failure rate statement throughout a year ("99% of the time annually"). Second, they argue that in practice, they see failures lasting between 1 and 10 minutes. However, they assert that we should assume each failure will last 10 minutes, completely ignoring the severity range they introduced.
Imagine your favorite pizza company claiming to deliver on time "99% of the time throughout a year." The author's logic is like saying, "The delivery driver knocks precisely 14 minutes late every day -- and each delay is 10 minutes exactly, no exceptions!". It completely ignores reality: sometimes your pizza is delivered a minute late, sometimes 10 minutes late, sometimes exactly on time for four months.
As a company with useful real-world data, I expect them not to make arguments based on exaggerations but rather show cold, hard data to back up their claims. For transparency, my organization has seen 51 degraded EBS volume events in the past 3 years across ~10,000 EBS volumes. Of those events, 41 had a duration of less than one minute, nine had a duration of two minutes, and one had a duration of three minutes.