undefined | Better HN

0 pointsKarrot_Kream4y ago0 comments

> If you're unable to be civil about this, maybe you should avoid the threads.

I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.

> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.

Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.

While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.

> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.

Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.

0 comments

1 comments · 1 top-level

mlyle4y ago

> I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something?

... I still don't think your overall starting assertions about the other people not understanding regions vs. AZs is correct, and it triggered you to repeatedly assert that the people you were talking to are unskilled.

I could very easily use the same words as them, and I have decade-old spreadsheets where I was playing with different combinations of latencies for commits and correlation coefficients for failures to try and estimate availability.

> Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.

I remember 2011, where EBS broke across all US-EAST AZs and lots of control plane services were impacted and you couldn't launch instances across all AZs in all regions for 12 hours.

Now maybe you'll be like "pfft, a decade ago!". I do think Amazon has significantly improved architecture. At the same time, AZs and regions being engineered to be independent doesn't mean they really are. We don't attain independent, uncorrelated failures on passenger aircraft, let alone these more complicated, larger, and less-engineered systems.

Further, even if AWS gets it right, going multi-AZ introduces new failure modes. Depending on the complexity of data model and operations on it, this stuff can be really hard to get right. Building a geographically distributed system with current tools is very expensive and there's no guarantee that your actual operational experience will be better than in a single site for quite some time of climbing the maturity curve.

> Their guarantees are written on their SLA pages.

Yup, and it's interesting to note that their thresholds don't really assume independence of failures. E.g. .995/.990/.95 are the thresholds for instances and .999/.990/.950 are the uptime thresholds for regions.

If Amazon's internal costing/reliability engineering model assumed failures would be independent, they could offer much better SLAs for regions safely. (e.g. back of the envelope, 1- (.005 * .005) * 3C2 =~ .999925 ) Instead, they imply that they expect multi-AZ has a failure distribution that's about 5x better for short outages and about the same for long outages.

And note there's really no SLA asserting independence of regions... You just have the instance level and region level guarantees.

Further, note that the SLA very clearly excludes some causes of multi-AZ failures within a region. Force majeure, and regional internet access issues beyond the "demarcation point" of the service.

j / k navigate · click thread line to collapse