Applying 5 Whys To Amazon EC2 Outage (opens in new tab)

(somic.org)

52 pointssomic13y ago14 comments

14 comments

12 comments · 4 top-level

ibejoeb13y ago· 4 in thread

> this outage is the most worrisome of all AWS service disruptions that I know about

I don't think anyone was especially happy with it. I think AWS, as an entity, is probably just as unhappy as its customers.

I was happy with their response, and I was happy with it during the outage last year, too. They're adapting, and I believe they're constantly getting better. It's still a pretty new thing, this utility computing service. You can't reasonably expect them to expect the unexpected. I'm quite sure that even if they didn't apply the "5 whys," or make them publicly available, they are doing something to address the control plane issue.

I'm confident that the service will improve. Some things just need to be battle hardened.

nowarninglabel13y ago

The problem I have with the whole thing, is that if you look at their status page they still don't report they had any outage. The worst you will ever see is a yellow triangle with a message of "connectivity issues". Amazon is pathologically obsessed with denying that any outages occur, which is understandable given their business model, but since they do actually have outages, it makes them look scummy.

ibejoeb13y ago

I agree. I think there's a policy problem. If a problem is isolated to a single AZ, you should see a hazard triangle. If the region is affected, it needs to be classified as a service disruption.

The whole problem with the AZ thing is that they're geographically congruent. Major weather events are pretty likely to mess you up. Remember the disk latency spikes from that little earthquake?

It costs more, but that they operate properties all over the world, and that they're operable under a common paradigm (for most services), is the truly compelling feature of AWS. I keep my major operation on the east and fail over to the west if there's a significant event. It's a little more labor intensive, but it works.

matt200013y ago

Is it true that they're getting better? As I understood it, the major outage last year was a cascading EBS failure caused by one AZ, which this was too. I personally do not have the feeling that EBS behavior is completely understood in different failure scenarios.

match13y ago

Interesting question, let's try to compare the two events. The 2011 event involved roughly 13% of ebs volumes in the affected zone, including multi-az control plane impact, the 2012 event involved 7% of EC2 instances in the affected zone. These two events are different, since one was a power event and the other a network event, but let's see how they compare in number of impacted volumes. It's not exactly clear how to compare these numbers, but if you assume nearly all EC2 instances (7% were affected, were any EBS servers affected? possibly the same % or none or in between) have at least the boot volume and possible more attached then maybe that's roughly 7-10+% of the volumes in the affected zone. Assuming a respectable growth rate (http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...) these events may have been around the same size (I'd be curious to hear other arguments for/against this guess).

If you compare the recovery time (ballpark, feel free to break down the timelines in your copious amounts of free time):

2011:

  12:47AM, Apr 21 - Event started, API impaired across all availability zones 

  12:00PM, Apr 21 - API recovered in non-affected zones
                        "Customers also experienced elevated error rates until Noon 
                         PDT on April 21st when attempting to launch new EBS-backed 
                         EC2 instances in Availability Zones other than the affected 
                         zone."

  12:30PM, Apr 22 - Nearly all volumes in affected zone restored
                        "all but about 2.2% of the volumes in the affected
                         Availability Zone were restored by 12:30PM PDT on 
                         April 22nd"

  18:15PM, Apr 23 - API restored for affected zone
                        "At 6:15 PM PDT on April 23rd, API access to EBS resources 
                         was restored in the affected Availability Zone."

2012:

  20:04, July 2 - Some number of racks lose power due to drained UPSs

  21:10, July 2 - API restore
                        "8:04pm PDT to 9:10pm PDT, customers were not able to launch
                         new EC2 instances, create EBS volumes, or attach volumes in
                         any Availability Zone in the US-East-1 Region. At 9:10pm PDT,
                         control plane functionality was restored for the Region."

  02:45, July 3 - Vast majority of volumes restored to customers
                        "By 2:45am PDT, 90% of outstanding volumes had been turned
                         over to customers."

http://aws.amazon.com/message/65648/ http://aws.amazon.com/message/67457/

Yes I'm painting with broad strokes here, and feel free to argue the details (we always do). But I do think this at least shows some improvement to answer the previous poster's question.

[edits to try to fix the formatting, fixed mis-paste]

1 more reply

burke13y ago· 3 in thread

"To me, this outage is the most worrisome of all AWS service disruptions that I know about. In a nutshell:
AWS effectively lost its control plane for entire region as a result of a failure within a single AZ. This was not supposed to be possible.
"

I find it funny how we have this assumption that if we don't architect across multiple AZs or regions we shouldn't be surprised when our service goes down because of an AWS failure, but that if we do, we're "pretty safe" -- and then Amazon itself experiences failure spanning AZs from a single-AZ failure.

Dylan1680713y ago

Yes, the outages that happen don't worry me much, things will happen. But they have inter-AZ issues in the management system disturbingly often.

Zombieball13y ago

Correct me if I am wrong, but I believe if your application was designed to operate across multiple 'regions' your app would have indeed been safe from this failure.

mokeefe13y ago

Sure, but then your app needs to communicate across the Internet if you share data across regions. This can be expensive and/or slow and/or unreliable. http://aws.amazon.com/ec2/faqs/#how_will_I_be_charged_for_da...

1 more reply

crazygringo13y ago· 1 in thread

The first why is actually the what, and the last why is unanswered, so there are only 3 whys... kind of disappointing based on the title :(

pbreit13y ago

There was 1 what and 5 whys. They last why being unanswered is the whole point of the post.

catshirt13y ago

"From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region."

"The control planes for EC2 and EBS were significantly impacted by the power failure” in a single AZ."

neither of these things are reasons for the disruption, but side effects of. not much "why" happening in the article all together.

j / k navigate · click thread line to collapse

14 comments

12 comments · 4 top-level

ibejoeb13y ago· 4 in thread

> this outage is the most worrisome of all AWS service disruptions that I know about

I don't think anyone was especially happy with it. I think AWS, as an entity, is probably just as unhappy as its customers.

I'm confident that the service will improve. Some things just need to be battle hardened.

nowarninglabel13y ago

ibejoeb13y ago

I agree. I think there's a policy problem. If a problem is isolated to a single AZ, you should see a hazard triangle. If the region is affected, it needs to be classified as a service disruption.

The whole problem with the AZ thing is that they're geographically congruent. Major weather events are pretty likely to mess you up. Remember the disk latency spikes from that little earthquake?

matt200013y ago

match13y ago

If you compare the recovery time (ballpark, feel free to break down the timelines in your copious amounts of free time):

2011:

  12:47AM, Apr 21 - Event started, API impaired across all availability zones 

  12:00PM, Apr 21 - API recovered in non-affected zones
                        "Customers also experienced elevated error rates until Noon 
                         PDT on April 21st when attempting to launch new EBS-backed 
                         EC2 instances in Availability Zones other than the affected 
                         zone."

  12:30PM, Apr 22 - Nearly all volumes in affected zone restored
                        "all but about 2.2% of the volumes in the affected
                         Availability Zone were restored by 12:30PM PDT on 
                         April 22nd"

  18:15PM, Apr 23 - API restored for affected zone
                        "At 6:15 PM PDT on April 23rd, API access to EBS resources 
                         was restored in the affected Availability Zone."

2012:

  20:04, July 2 - Some number of racks lose power due to drained UPSs

  21:10, July 2 - API restore
                        "8:04pm PDT to 9:10pm PDT, customers were not able to launch
                         new EC2 instances, create EBS volumes, or attach volumes in
                         any Availability Zone in the US-East-1 Region. At 9:10pm PDT,
                         control plane functionality was restored for the Region."

  02:45, July 3 - Vast majority of volumes restored to customers
                        "By 2:45am PDT, 90% of outstanding volumes had been turned
                         over to customers."

http://aws.amazon.com/message/65648/ http://aws.amazon.com/message/67457/

Yes I'm painting with broad strokes here, and feel free to argue the details (we always do). But I do think this at least shows some improvement to answer the previous poster's question.

[edits to try to fix the formatting, fixed mis-paste]

1 more reply

burke13y ago· 3 in thread

Dylan1680713y ago

Yes, the outages that happen don't worry me much, things will happen. But they have inter-AZ issues in the management system disturbingly often.

Zombieball13y ago

Correct me if I am wrong, but I believe if your application was designed to operate across multiple 'regions' your app would have indeed been safe from this failure.

mokeefe13y ago

1 more reply

crazygringo13y ago· 1 in thread

The first why is actually the what, and the last why is unanswered, so there are only 3 whys... kind of disappointing based on the title :(

pbreit13y ago

There was 1 what and 5 whys. They last why being unanswered is the whole point of the post.

catshirt13y ago

"From 8:04pm PDT to 9:10pm PDT, customers were not able to launch new EC2 instances, create EBS volumes, or attach volumes in any Availability Zone in the US-East-1 Region."

"The control planes for EC2 and EBS were significantly impacted by the power failure” in a single AZ."

neither of these things are reasons for the disruption, but side effects of. not much "why" happening in the article all together.

j / k navigate · click thread line to collapse