(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)
What he said was perfectly cogent.
Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.
Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.
So, if you span multiple availability zones, you are not spared from events that will impact all of them.
It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.
It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!
The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.
If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.
[1]: https://www.usenix.org/system/files/conference/nsdi13/nsdi13...
This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.
I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.
That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.
You can't engineer around AWS AZ common-mode failures using AWS.
The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.
It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!
For a more complete list of their SLA's for every service: https://aws.amazon.com/legal/service-level-agreements/?aws-s...
They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!
Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.
Anywho, they have a lot more room in their published SLA's than you think.
Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.
The uptime.is website is a handy resource for these calculations. For example, http://uptime.is/99.9 says
"SLA level of 99.9 % uptime/availability results in the following periods of allowed downtime/unavailability:
Daily: 1m 26s
Weekly: 10m 4s
Monthly: 43m 49s
Quarterly: 2h 11m 29s
Yearly: 8h 45m 56s"0.001 (.1%) * 8760 (365d*24h) = 8.76h
Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)
Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.
The only problem with that is, there are no independent availability zones.
What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.
There are - they can be as independent as you need them to be.
Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.