Anyone else experiencing a problem?
Their uptime is much higher on average than any IT team I've ever been involved in.
Most of the time people here might be seeing good portions of their infra go away, but the number isn't statistically significant to the overall region health for them to post an outage.
Don't ask me what those numbers are, but that is the way it is determined.
You might have an incident affecting just 2% of the API calls, and affecting less than 2% of the user base (even that would be unusually large and a source of big drama internally). The service could be super stable and extremely reliable, but that 98% could get completely the wrong idea if they saw a service status, (and of course from a PR perspective, the same goes for anyone looking to use the platform.)
A service dashboard is an extremely blunt tool with which to pass out a message about service status. It renders what is an extremely nuanced situation down to "All good, maybe, no, DEAD"
To give a rough example, one service I was familiar with had a "page everyone in the team" level of incident. API availability tanked, badly. It looked atrocious, and seemed like hardly any requests were getting through successfully. You'd have every expectation that they should at least post a yellow alert, if not approaching red. It turned out that it was one single customer who's requests were failing (I forget the reason why), but due to a bug in the customer's software consuming the API, every time it got a 500 response, it would immediately resend the request, every single time, with no timeout or limited retry number. It reached such a terrific pace it got to the point where they made up a huge majority of all the requests hitting the endpoint. Every other customer using the service was completely fine. If you'd looked at the API graphs you'd think "POST YELLOW, POST YELLOW, NOW NOW NOW!", but because they took time to figure out the actual impact, they found out that would have been totally the wrong thing to do.
Service health dashboards are a neat idea, but one that is in desperate need of a rethink and overhaul. It has some value when you're a smaller service, but it just doesn't accurately scale with the platform.
I'm not sure what the real solution is. They've somehow got to pull together TB of logs and/or metrics to make an accurate assessment of the scenario, and do it in a matter of minutes, so as to provide accurate updates, and not needlessly panic customers.
Red's for heat death of the universe.
Another thing to look into is EC2 Auto Recovery [1]. I don't know if this would've kicked in with today's event, but it's worth setting up as an extra safety net.
[1] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazo...
edit: I'm basing this off the status page which indicated that only one AZ was impacted.
Both AZs are directly under the deluge and I don't believe only one AZ is affected for a second.
The size of the storm can be seen here http://www.bom.gov.au/products/IDR713.loop.shtml#skip
10:47 PM PDT We are investigating increased connectivity issues for EC2 instances in the AP-SOUTHEAST-2 Region.
11:08 PM PDT We continue to investigate connectivity issues for some instances in a single Availability Zone and increased API error rates for the EC2 APIs in the AP-SOUTHEAST-2 Region.
11:49 PM PDT We can confirm that instances have experienced a power event within a single Availability Zone in the AP-SOUTHEAST-2 Region. Error rates for the EC2 APIs have improved and launches of new EC2 instances are succeeding within the other Availability Zones in the Region.
Jun 5, 12:31 AM PDT We have restored power to the affected Availability Zone and are working to restore connectivity to the affected instances.
I'm joking of course, but that's what ran through my mind while reading that timeline.
What a mess.
See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...