undefined | Better HN

0 pointsKarrot_Kream4y ago0 comments

Availability Zones aren't the same thing as regions. AWS regions have multiple Availability Zones. Independent availability zones publishes lower reliability SLAs so you need to load balance across multiple independent availability zones in a region to reach higher reliability. Per AZ SLAs are discussed in more detail here [1]

(N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

[1]: https://aws.amazon.com/compute/sla/

0 comments

12 comments · 3 top-level

mlyle4y ago· 4 in thread

> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

What he said was perfectly cogent.

Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.

Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

So, if you span multiple availability zones, you are not spared from events that will impact all of them.

Karrot_KreamOP4y ago

> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

It's up to the _user_ of AWS to design around this level of reliability. This isn't any different than not using AWS. I can run my web business on the super cheap by running it out of my house. Of course, then my site's availability is based around the uptime of my residential internet connection, my residential power, my own ability to keep my server plugged into power, and general reliability of my server's components. I can try to make things more reliable by putting it into a DC, but if a backhoe takes out the fiber to that DC, then the DC will become unavailable.

It's up to the _user_ to architect their services to be reliable. AWS isn't magic reliability sauce you sprinkle on your web apps to make them stay up for longer. AWS clearly states in their SLA pages what their EC2 instance SLAs are in a given AZ; it's 99.5% availability for a given EC2 instance in a given region and AZ. This is roughly ~1.82 days, or ~ 43.8 hours, of downtime in a year. If you add a SPOF around a single EC2 instance in a given AZ then your system has a 99.5% availability SLA. Remember the cloud is all about leveraging large amounts commodity hardware instead of leveraging large, high-reliability mainframe style design. This isn't a secret. It's openly called out, like in Nishtala et al's "Scaling Memcache at Facebook" [1] from 2013!

The background of all of this is that it costs money, in terms of knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions) who understand these issues. Most companies don't care; they're okay with being down for a couple days a year. But if you want to design high reliability architectures, there are plenty of senior engineers willing to help, _if_ you're willing to pay their salaries.

If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.

[1]: https://www.usenix.org/system/files/conference/nsdi13/nsdi13...

roughly4y ago

During a recent AWS outage, the STS service running in us-east-1 was unavailable. Unfortunately, all of the other _regions_ - not AZs, but _regions_, rely on the STS service in us-east-1, which meant that customers which had built around Amazon’s published reliability model had services in every region impacted by an outage in one specific availability zone.

This is what kreeben was referring to - not some abstract misconception about the difference between AZs and Regions, but an actual real world incident in which a failure in one AZ had an impact in other Regions.

2 more replies

mlyle4y ago

> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)

I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.

That is, I've read the comments to say "they're not only in different AZ's, they're in different regions". It seems you seem determined to pick a reading that lets you feel smugly superior about your level of knowledge, and then cast out digs at other people based on that presumed superiority.

1 more reply

mlyle4y ago

Yes, but the underlying point you're willfully missing is:

You can't engineer around AWS AZ common-mode failures using AWS.

The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.

johnmarcus4y ago· 3 in thread

Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!

For a more complete list of their SLA's for every service: https://aws.amazon.com/legal/service-level-agreements/?aws-s...

They only refund 100% when they fall below 95% of availability! 95-99= 30%. I believe the real target is above 99.9% though, as that results in 0 refund to the customer. What that means is, 3 days of downtime is acceptable!

Alternatively, you can return to your own datacenter and find out first hand that it's not particularly as easy to deliver that as you may think. You too will have power outages, network provider disruptions, and the occasional "oh shit, did someone just kick that power cord out?" or complete disk array meltdowns.

Anywho, they have a lot more room in their published SLA's than you think.

Edit: as someone correctly pointed out i did a typo in my math. it is only ~9 hours of aloted downtime. Keeping in mind that this is per service though - meaning each service can have a different 9 hours of downtime before they need to pay out 10% of that one service. I still stand by my statement thier SLA's have a lot of wiggle room that people should take more seriously.

sciurus4y ago

As someone else said, your math is off. Your point is still reasonable, though.

The uptime.is website is a handy resource for these calculations. For example, http://uptime.is/99.9 says

"SLA level of 99.9 % uptime/availability results in the following periods of allowed downtime/unavailability:

    Daily: 1m 26s
    Weekly: 10m 4s
    Monthly: 43m 49s
    Quarterly: 2h 11m 29s
    Yearly: 8h 45m 56s"

mqnfred4y ago

Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:

0.001 (.1%) * 8760 (365d*24h) = 8.76h

Alternatively, the common industry standard in infrastructure (the place I work at at least,) is 4 nines, so 99.99% availability, which is around 52 mins a year or 4 mins a month iirc. There's not as much room as you'd think! :)

Karrot_KreamOP4y ago

> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.

kreeben4y ago· 2 in thread

>> you need to load balance across multiple independent availability zones

The only problem with that is, there are no independent availability zones.

What we do have, though, is an architecture where errors propagate cross-zone until they can't propagate any further, because services can't take any more requests, because they froze, because they weren't designed for a split brain scenario, and then, half the internet goes down.

outworlder4y ago

> The only problem with that is, there are no independent availability zones.

There are - they can be as independent as you need them to be.

Errors won't necessarily propagate cross-zone. If they do, someone either screwed up, or they made a trade-off. Screwing up is easy, so you need to do chaos testing to make sure your system will survive as intended.

kreeben4y ago

I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.

j / k navigate · click thread line to collapse

0 comments

12 comments · 3 top-level

mlyle4y ago· 4 in thread

> (N.B. I find HN commentary on AWS outages pretty depressing because it becomes pretty obvious that folks don't understand cloud networking concepts at all.)

What he said was perfectly cogent.

Outages in us-east-1 AZ us-east-1a have caused outages in us-west-1a, which is a different region and a different AZ.

Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

So, if you span multiple availability zones, you are not spared from events that will impact all of them.

Karrot_KreamOP4y ago

> Or, to put it in the terms of reliability engineering: even though these are abstracted as independent systems, in reality there are common-mode failures that can cause outages to propagate.

If you want to come up with a lower cognitive overhead cloud solution for high reliability services that's economical for companies, be my guest. I think we'd all welcome innovation in this space.

[1]: https://www.usenix.org/system/files/conference/nsdi13/nsdi13...

roughly4y ago

2 more replies

mlyle4y ago

> knowledgable engineers (not like the kinds in this comment thread who are conflating availability zones and regions)

I think this breaks the site guidelines. Worse, I don't think the other people are wrong: being in a different region implies being in a different availability zone.

1 more reply

mlyle4y ago

Yes, but the underlying point you're willfully missing is:

You can't engineer around AWS AZ common-mode failures using AWS.

The moment that you have failures that are not independent and common mode, you can't just multiply together failure probabilities to know your outage times.

johnmarcus4y ago· 3 in thread

Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

It's not. .1% of 36524 = 87.6 hours of downtime - that's over 3 days of complete downtime every year!

For a more complete list of their SLA's for every service: https://aws.amazon.com/legal/service-level-agreements/?aws-s...

Anywho, they have a lot more room in their published SLA's than you think.

sciurus4y ago

As someone else said, your math is off. Your point is still reasonable, though.

The uptime.is website is a handy resource for these calculations. For example, http://uptime.is/99.9 says

"SLA level of 99.9 % uptime/availability results in the following periods of allowed downtime/unavailability:

    Daily: 1m 26s
    Weekly: 10m 4s
    Monthly: 43m 49s
    Quarterly: 2h 11m 29s
    Yearly: 8h 45m 56s"

mqnfred4y ago

Your computation is incorrect, 3 days out of 365 is 1% of downtime, not 0.1%. I believe your error stems from reporting .1% as 0.1. Indeed:

0.001 (.1%) * 8760 (365d*24h) = 8.76h

Karrot_KreamOP4y ago

> Yup, so true. People think redundant == 100% uptime, or that when they advertise 99.9% uptime, it's the same thing as 100% minus a tiny bit for "glitches".

Maybe this is the problem. 99.9% isn't being used by AWS the way people use it in conversation; it has a definite meaning, and they'll refund you based on that definite meaning.

kreeben4y ago· 2 in thread

>> you need to load balance across multiple independent availability zones

The only problem with that is, there are no independent availability zones.

outworlder4y ago

> The only problem with that is, there are no independent availability zones.

There are - they can be as independent as you need them to be.

kreeben4y ago

I'm not talking about my global app. I'm talking about the system I deploy to, the actual plumbing, and how a huge turd in a western toilet causes east's sewerage system to over-flow.

j / k navigate · click thread line to collapse