AWS Service Interuptions

78 pointscodingninja10y ago53 comments

AWS Is currently having issues with a bunch of our instances and the console reports the following error "An error occurred fetching instance data: The service is unavailable. Please try again shortly. "

Anyone else experiencing a problem?

53 comments

41 comments · 18 top-level

romanr10y ago· 7 in thread

It will take a direct hit with nuclear weapon on the datacenter for Amazon to change icon to red on service status page.

mrmondo10y ago

Yeah we monitor lots of Amazon & Microsoft 'cloud' services, we observe much, much higher downtime / number of outages than they ever report in a order of 50 to 1 or more. What do you expect though, both companies are known for lying through teeth to convince the IT community (or more likely the IT managers) that their services are reliable for everyone and amazing uptime and that they're not only a good option but the only option.

mbesto10y ago

> What do you expect though, both companies are known for lying through teeth to convince the IT community (or more likely the IT managers) that their services are reliable for everyone and amazing uptime

Their uptime is much higher on average than any IT team I've ever been involved in.

1 more reply

res0nat0r10y ago

Those icons don't change unless a certain percentage of the overall count of instances in an AZ or region are affected.

Most of the time people here might be seeing good portions of their infra go away, but the number isn't statistically significant to the overall region health for them to post an outage.

Don't ask me what those numbers are, but that is the way it is determined.

anshargal10y ago

Sounds interesting. Is that data available?

Twirrim10y ago

From my experience in AWS, part of the problem is scope of impact. It's easy to lose track of just how many active customers there are at any time, and it's easy to see the platform as a cohesive whole, i.e. "If it's affecting you it must be affecting everyone else". In reality almost every customer impacting event affects only a tiny percentage of the active users at any one time. I know it can be hard to believe or see this as an external customer, because after all the service appears to be down to you. Take, for example, when people start saying "us-east-1a" is down. What is "us-east-1a"? If you've watched some of the re-invent talks you'll know that it actually describes numerous data centres, in close proximity (within a certain millisecond network target). If one of those has an incident, it might look to some customers like "us-east-1a" is down, when the reality might be that 95%+ of the data centres still fully functional, and most customers aren't seeing an impact.

You might have an incident affecting just 2% of the API calls, and affecting less than 2% of the user base (even that would be unusually large and a source of big drama internally). The service could be super stable and extremely reliable, but that 98% could get completely the wrong idea if they saw a service status, (and of course from a PR perspective, the same goes for anyone looking to use the platform.)

A service dashboard is an extremely blunt tool with which to pass out a message about service status. It renders what is an extremely nuanced situation down to "All good, maybe, no, DEAD"

To give a rough example, one service I was familiar with had a "page everyone in the team" level of incident. API availability tanked, badly. It looked atrocious, and seemed like hardly any requests were getting through successfully. You'd have every expectation that they should at least post a yellow alert, if not approaching red. It turned out that it was one single customer who's requests were failing (I forget the reason why), but due to a bug in the customer's software consuming the API, every time it got a 500 response, it would immediately resend the request, every single time, with no timeout or limited retry number. It reached such a terrific pace it got to the point where they made up a huge majority of all the requests hitting the endpoint. Every other customer using the service was completely fine. If you'd looked at the API graphs you'd think "POST YELLOW, POST YELLOW, NOW NOW NOW!", but because they took time to figure out the actual impact, they found out that would have been totally the wrong thing to do.

Service health dashboards are a neat idea, but one that is in desperate need of a rethink and overhaul. It has some value when you're a smaller service, but it just doesn't accurately scale with the platform.

I'm not sure what the real solution is. They've somehow got to pull together TB of logs and/or metrics to make an accurate assessment of the scenario, and do it in a matter of minutes, so as to provide accurate updates, and not needlessly panic customers.

dgemm10y ago

Amazon's own criteria are yellow for one AZ down (which this one was), red for multiple AZs down.

ceejayoz10y ago

Nah, that'd be green with the ! icon.

Red's for heat death of the universe.

origami77710y ago· 5 in thread

I realize that some systems may need to have all of their servers located close together in a single AZ. But barring that, if this took you offline, you should really consider spreading your instances across AZs. It's so easy there's no excuse not to do it.

Another thing to look into is EC2 Auto Recovery [1]. I don't know if this would've kicked in with today's event, but it's worth setting up as an extra safety net.

[1] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazo...

edit: I'm basing this off the status page which indicated that only one AZ was impacted.

rnhu10y ago

The site I manage is load balanced across both AZs ap-southeast-2a and ap-southeast-2b which did not save it. At the moment ec2 statuses are not being updated which is preventing ELBs from registering instances as healthy.

Both AZs are directly under the deluge and I don't believe only one AZ is affected for a second.

The size of the storm can be seen here http://www.bom.gov.au/products/IDR713.loop.shtml#skip

origami77710y ago

This is the most concerning thing to me. The Multi-AZ, redundant setup is worthless if the ELB can't do its job properly. I've seen some funky behavior from the ELBs when it comes to instance state. They really need to make this better.

PebblesHD10y ago

Sadly our use case (private data etc.) prevents us from leaving the local availability zone, meaning when it went down today we were left totally unavailable. The recovery itself is ongoing but our applications are resilient enough to detect the restored connections and automatically add themselves back into the cluster.

NeutronBoy10y ago

Availability zones are different from regions. You can be in multiple AZ's within the Sydney region still.

1 more reply

origami77710y ago

That's interesting. Is it an Australian regulation? Curious that they'd make it in such a way that the data had to reside in the same building/zone.

1 more reply

sidcool10y ago· 3 in thread

From AWS status page for Asia Pacific:

10:47 PM PDT We are investigating increased connectivity issues for EC2 instances in the AP-SOUTHEAST-2 Region.

11:08 PM PDT We continue to investigate connectivity issues for some instances in a single Availability Zone and increased API error rates for the EC2 APIs in the AP-SOUTHEAST-2 Region.

11:49 PM PDT We can confirm that instances have experienced a power event within a single Availability Zone in the AP-SOUTHEAST-2 Region. Error rates for the EC2 APIs have improved and launches of new EC2 instances are succeeding within the other Availability Zones in the Region.

Jun 5, 12:31 AM PDT We have restored power to the affected Availability Zone and are working to restore connectivity to the affected instances.

tvmalsv10y ago

It took them an hour to figure out that their connectivity issues were caused by losing power to an entire Availability Zone? Maybe they should add an alert for "AZ has no power" or put it on a dashboard...

I'm joking of course, but that's what ran through my mind while reading that timeline.

AdamJacobMuller10y ago

Wasn't quite that simple. I lost connectivity to instances that did not reboot so I'm guessing it took out some network elements.

madeofpalk10y ago

Been having issues from Sydney AP-Southeast-2, probs from bigger-than-usual storm that's been going on here for the past few days.

karmacondon10y ago· 3 in thread

I recently switched to Google Compute Engine. It's cheaper and so far more reliable than AWS. Might be another option for some people here.

jmiserez10y ago

Even GCE has had (global) outages (interesting post-mortem here [1]), no provider is really safe from these sorts of issues.

[1] https://news.ycombinator.com/item?id=11489791

sidcool10y ago

I am trying to convince people at my work to move to GCP from AWS, but AWS truly has become the Microsoft of Cloud computing. Many people have no idea there are other providers like GCP, Azure, DigitalOcean etc.

jdc058910y ago

Azure might be great in a year or so, but it makes me uneasy as is. Some of the services are great, but a lot of them are pretty fragmented. I've had so many instances where our billing/usage data has just "disappeared" for a few days, undocumented changes have been made to the formats of reports/exports/APIs, and official documentation is plain wrong that I just can't recommend Azure to anyone. Not to mention they have the most expensive infrastructure costs out of the major players (even with an EA and decent monetary commitment); their premium for Windows licensing is the lowest by far though (not surprising), so it does end up being a cheaper option for super windows heavy shops though.

1 more reply

25thhour10y ago· 2 in thread

Still down 5 hours later. ELB won't register instances. Ugh

inopinatus10y ago

The ELB control plane woke up for us about 90 minutes ago; back to flying on all engines again now.

25thhour10y ago

Multiple ELB's came alive around that timeframe but our primary ELB has remained unable to re-register instances. Creating a new ELB as a test and trying to register new instances from the effected ASG has also failed.

DenisM10y ago· 1 in thread

I see lots of people using ELB for load balancing. Anyone tried using DNS on top of ELB to spread the load? That might just save you from the extended downtime.

dsmithatx10y ago

Generally the ELB should have instances in different availability zones which are data centers miles apart. If your ELB went down creating a new one should be simple if you can access the region. The problem with high availability and spreading load is how to deal with your database and recovery.

aaratn10y ago· 1 in thread

Works for me - Sydney Region

ParadisoShlee10y ago

I cannot tell if this is an amazing joke or not.

mhealy10y ago· 1 in thread

Yes, it seems that zone A is completely down. However, load balancers seem to be affected as well.

nnx10y ago

Your zone A might be another customer's zone B. AWS maps availability zones per account.

See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...

lobe10y ago

Not sure if relevant to this issue, but Sydney is currently being hit with one of the biggest storms I can remember in the past few years. Probably not crazy enough to take down a DC, but might be a contributing factor in this outage.

WDCDev10y ago

Our Sydney EC2 DB instance is stuck spinning in the "stopping" state so we are basically offline right now. The team is working on getting a new DB instance set up, but I read that our payment provide, Westpac, is also having issues. So even if we do get back online, users might not be able to purchase.

What a mess.

MasterNayru10y ago

For all intents and purposes, we're completely offline at the moment. It's clearly some serious issue because the icon for EC2 in Sydney on AWS' status page is yellow, rather than the usual green tick with the small 'i'.

PebblesHD10y ago

Another confirmation here, all services in our Sydney AZ are down. AWS Support last mentioned a power failure or similar in AZ1, but some of ours are coming back online now.

rmdoss10y ago

Yes, AWS EC2 (Sydney) is completely offline from what we see. We have almost 10 servers there unaccessible for over an hour.

schappim10y ago

Still having issues with accessing apps hosted on Elastic Beanstalk on AP-SOUTHEAST-2 Region. Restarting app servers / rebuilding the environment doesn't make a difference.

theathea10y ago

Curious if you are all based in Australia, or if the Sydney outage is effecting other regions?

vfulco10y ago

Anyone having problems with BJ servers? My site is not running, https://www.weisisheng.cn, I can not SSH into machine nor access login page to AWS dashboard.

mysteriousmango10y ago

We appear to be back online, however all machines have rebooted.

25thhour10y ago

ap-southeast-2 EC2 appears to be completely offline for us

j / k navigate · click thread line to collapse

53 comments

41 comments · 18 top-level

romanr10y ago· 7 in thread

It will take a direct hit with nuclear weapon on the datacenter for Amazon to change icon to red on service status page.

mrmondo10y ago

mbesto10y ago

Their uptime is much higher on average than any IT team I've ever been involved in.

1 more reply

res0nat0r10y ago

Those icons don't change unless a certain percentage of the overall count of instances in an AZ or region are affected.

Most of the time people here might be seeing good portions of their infra go away, but the number isn't statistically significant to the overall region health for them to post an outage.

Don't ask me what those numbers are, but that is the way it is determined.

anshargal10y ago

Sounds interesting. Is that data available?

Twirrim10y ago

A service dashboard is an extremely blunt tool with which to pass out a message about service status. It renders what is an extremely nuanced situation down to "All good, maybe, no, DEAD"

dgemm10y ago

Amazon's own criteria are yellow for one AZ down (which this one was), red for multiple AZs down.

ceejayoz10y ago

Nah, that'd be green with the ! icon.

Red's for heat death of the universe.

origami77710y ago· 5 in thread

Another thing to look into is EC2 Auto Recovery [1]. I don't know if this would've kicked in with today's event, but it's worth setting up as an extra safety net.

[1] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazo...

edit: I'm basing this off the status page which indicated that only one AZ was impacted.

rnhu10y ago

Both AZs are directly under the deluge and I don't believe only one AZ is affected for a second.

The size of the storm can be seen here http://www.bom.gov.au/products/IDR713.loop.shtml#skip

origami77710y ago

PebblesHD10y ago

NeutronBoy10y ago

Availability zones are different from regions. You can be in multiple AZ's within the Sydney region still.

1 more reply

origami77710y ago

That's interesting. Is it an Australian regulation? Curious that they'd make it in such a way that the data had to reside in the same building/zone.

1 more reply

sidcool10y ago· 3 in thread

From AWS status page for Asia Pacific:

10:47 PM PDT We are investigating increased connectivity issues for EC2 instances in the AP-SOUTHEAST-2 Region.

11:08 PM PDT We continue to investigate connectivity issues for some instances in a single Availability Zone and increased API error rates for the EC2 APIs in the AP-SOUTHEAST-2 Region.

Jun 5, 12:31 AM PDT We have restored power to the affected Availability Zone and are working to restore connectivity to the affected instances.

tvmalsv10y ago

I'm joking of course, but that's what ran through my mind while reading that timeline.

AdamJacobMuller10y ago

Wasn't quite that simple. I lost connectivity to instances that did not reboot so I'm guessing it took out some network elements.

madeofpalk10y ago

Been having issues from Sydney AP-Southeast-2, probs from bigger-than-usual storm that's been going on here for the past few days.

karmacondon10y ago· 3 in thread

I recently switched to Google Compute Engine. It's cheaper and so far more reliable than AWS. Might be another option for some people here.

jmiserez10y ago

Even GCE has had (global) outages (interesting post-mortem here [1]), no provider is really safe from these sorts of issues.

[1] https://news.ycombinator.com/item?id=11489791

sidcool10y ago

jdc058910y ago

1 more reply

25thhour10y ago· 2 in thread

Still down 5 hours later. ELB won't register instances. Ugh

inopinatus10y ago

The ELB control plane woke up for us about 90 minutes ago; back to flying on all engines again now.

25thhour10y ago

DenisM10y ago· 1 in thread

I see lots of people using ELB for load balancing. Anyone tried using DNS on top of ELB to spread the load? That might just save you from the extended downtime.

dsmithatx10y ago

aaratn10y ago· 1 in thread

Works for me - Sydney Region

ParadisoShlee10y ago

I cannot tell if this is an amazing joke or not.

mhealy10y ago· 1 in thread

Yes, it seems that zone A is completely down. However, load balancers seem to be affected as well.

nnx10y ago

Your zone A might be another customer's zone B. AWS maps availability zones per account.

See http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-reg...

lobe10y ago

WDCDev10y ago

What a mess.

MasterNayru10y ago

PebblesHD10y ago

Another confirmation here, all services in our Sydney AZ are down. AWS Support last mentioned a power failure or similar in AZ1, but some of ours are coming back online now.

rmdoss10y ago

Yes, AWS EC2 (Sydney) is completely offline from what we see. We have almost 10 servers there unaccessible for over an hour.

schappim10y ago

Still having issues with accessing apps hosted on Elastic Beanstalk on AP-SOUTHEAST-2 Region. Restarting app servers / rebuilding the environment doesn't make a difference.

theathea10y ago

Curious if you are all based in Australia, or if the Sydney outage is effecting other regions?

vfulco10y ago

Anyone having problems with BJ servers? My site is not running, https://www.weisisheng.cn, I can not SSH into machine nor access login page to AWS dashboard.

mysteriousmango10y ago

We appear to be back online, however all machines have rebooted.

25thhour10y ago

ap-southeast-2 EC2 appears to be completely offline for us

j / k navigate · click thread line to collapse