I didn't read my tone as uncivil, just harsh. I guess it came across harsher than intended. I'll try to cool it a bit more next time, but I have to say it's not like the rest of HN is taking this advice to heed when they're criticizing AWS. I realize that this isn't a defense (whataboutism), but I guess it's fine to "speak truth to power" or something? Anyway point noted and I'll try to keep my snark down.
> Amazon seeks to avoid common-mode failures between AZs (and thus regions). This doesn't mean that Amazon attains this goal. And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Right, so which common mode failures are occurring here? What I'm seeing in this thread and previous AWS threads is a lot of hearsay. Stuff like "the AWS console isn't loading" or "I don't have that problem on Linode!" or "the McDonalds app isn't working so everything is broken thanks to AWS!" I'd love to see a postmortem document, like this, actually uncover one of these common mode failures. Not because I doubt they exist (any real system has bugs and I have no doubt a real distributed system has real limitations); I just haven't seen it borne in real world experience at my current company and other companies I've worked at which used AWS pretty heavily.
While I don't work at AWS, my company also publishes an SLA and we refund our customers when we dip below that SLA. When an outage, SLA-impacting or not, occurs, we spend a _lot_ of time getting to the bottom of what happened and documenting what went wrong. Frequently it's multiple things that go wrong which cause a sort of cascading failure that we didn't catch or couldn't reproduce in chaos testing. Part of the process of architecting solutions for high scale (~ billions/trillions of weekly requests) is to work through the AWS docs and make sure we select the right architecture to get the guarantees we seek. I'd like to see evidence of common-mode failures and the defensive guarantees that failed in order show proof of them, or proof positive through a dashboard or something, before I'm willing to malign AWS so easily.
> And the larger point: as I'm sure you're aware, building a distributed system that attains higher uptimes by crossing multiple AZs is hard and costly and can only be justified in some cases.
Sure if you're not operating high reliability services at high scale, it's true, you don't need cross-AZ or cross-region failover. But if you chose, through balance sheet or ignorance, not to take advantage of AWS's reliability features then you shouldn't get to complain that AWS is unreliable. Their guarantees are written on their SLA pages.