Amazon EC2 and RDS in US-EAST zone down (opens in new tab)

(status.aws.amazon.com)

90 pointsakhkharu14y ago94 comments

multi-az deployments affected too.

94 comments

62 comments · 24 top-level

mikebo14y ago· 11 in thread

Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

keithnoizu14y ago

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

shiftpgdn14y ago

At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.

3 more replies

nadahalli14y ago

I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.

2 more replies

werkshy14y ago

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

gouranga14y ago

That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.

TazeTSchnitzel14y ago

Can't you tell management that it isn't as reliable as they claim?

1 more reply

its_so_on14y ago

If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?

1 more reply

malachismith14y ago

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

rabbitfang14y ago

The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.

res0nat0r14y ago

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

mikebo14y ago

No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.

1 more reply

stevefink14y ago· 6 in thread

Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).

nirvdrum14y ago

What zone? I really wish Amazon would provide that info, instead of saying that it only affects one zone.

gabrtv14y ago

AFAIK zones are randomized. 1a for me is 1d for you.

3 more replies

aaronharder14y ago

Unfortunately, there is no meaningful way for them to say which zone because zone labels are different for each user.

iharris14y ago

I have instances in two different zones - both are down, although I don't know if AWS's randomization means that my 1a and 1d are actually located in the same logical zone.

1 more reply

stevefink14y ago

Both my MySQL master (I'm not using RDS) and Redis Master servers are affected and are located in zone us-east-1a.

1 more reply

zedwill14y ago

It is affecting me on US-EAST-1B

bad_user14y ago· 4 in thread

I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.

NathanKP14y ago

Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.

Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.

RegEx14y ago

I've learned to ignore the checks passed for quite a while, especially for servers on load balance.

iharris14y ago

SNS sent me an e-mail of my instance alarms pretty quickly.

EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.

keithnoizu14y ago

By over fifteen minutes in my case. Possibly thirty. WTH.

pearle14y ago· 4 in thread

Anyone have any details on why us-east-1 seems to be less reliable than the other regions? Is it the oldest?

jaylevitt14y ago

According to this calculation (which attempted to probe all the racks in EC2), over 70% of EC2 lives in us-east.

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

sausagefeet14y ago

I'm under the impression it's the most used.

NoPiece14y ago

It probably is the most used, being a cheaper alternative to us-west, but are you suggesting it fails more because it is used more? It does seem that the big AWS outages (in the us) have been concentrated in us-east. I have wondered if it just because us-east is newer so they haven't had has much time to work things out, or that the us-west team is a little better?

edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.

2 more replies

malachismith14y ago

It's the oldest, yes.

rdl14y ago· 3 in thread

I'm curious why no public paas is multiple AWS region.

malachismith14y ago

1) because AWS East is so much cheaper (and none of us like spending money) 2) AppFog actually is multi region (and multi IaaS as well)

kanwisher14y ago

Oregon is same as AWS East, seems to have a smaller set of boxes, have gotten errors in the past about not having any more servers to allocate.

1 more reply

rdl14y ago

I'd tolerate multi-AZ as a baseline.

Thanks for AppFog -- I hadn't heard of them, but will check them out.

rabble14y ago· 3 in thread

Good time to consider Google's Compute Engine as an alternative? What will we call it, GCE?

jfoutz14y ago

currently, it is a limited beta. Also, it looks to be more expensive.

malachismith14y ago

Actually, if you do the normalization to make it apples to apples (and adjust for the difference in RAM) it looks price competitive. My numbers make it look slightly more expensive than AWS EAST (teh suck) and slightly less expensive than AWS WEST.

1 more reply

vachi14y ago

ahh Acronyms

tolos14y ago· 2 in thread

Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?

pjscott14y ago

Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.

bmelton14y ago

I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.

Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.

KenCochrane14y ago· 2 in thread

9:32 AM PDT Connectivity has been restored to the affected subset of EC2 instances and EBS volumes in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. Some of the affected EBS volumes are still re-mirroring causing increased IO latency for those volumes.

KenCochrane14y ago

I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.

bad_user14y ago

For what is worth, my small website is online again.

keithnoizu14y ago· 1 in thread

I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.

"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."

gooeyblob14y ago

Absolutely agree - that's just silly. Their status page is close to useless.

mattwdelong14y ago· 1 in thread

It's not entirely down as I can still access my instances. I'm in us-east-1b.

grourk14y ago

Your us-east-1b might be my us-east-1a.

zedwill14y ago· 1 in thread

Interesting enough not only the EBS is down, but ELB can not register instances even if there are not EBS based and completely operational.

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

oasisbob14y ago

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

ELBs are sometimes EBS backed.

gregholmberg14y ago

Individual availability zones can be identified using the API.

   ec2-describe-reserved-instances-offerings --region

will tell you what the zone's identifier is.

After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.

This Alestic article shows how to label them all.

[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones

dwhsix14y ago

Keep in mind AZs are different per account. My us-east-1b is not necc'ly your us-east-1b (as someone reminded me on twitter just now).

pjscott14y ago

EC2 comes with a free Chaos Monkey service. It's called EC2.

I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.

It's great when you suddenly need a hundred more servers, though.

pearle14y ago

I'm running in us-east-1 and my EC2 instances and EBS volumes are still responding ok for the moment...

Fingers crossed (just deployed to AWS less than 2 weeks ago).

DigitalSea14y ago

Issue #3298392 for EC2 this month. This is ridiculous, so many websites rely on EC2 and it's proving to be extremely unreliable. Cloud computing is definitely not the answer to everything it would seem.

mattbillenstein14y ago

I suggest until Amazon uses RDS For their database - that you don't either...

anuraj14y ago

Mine is okay

ahmedaly14y ago

dotcloud was down also but its now up. (they rely on ec2)

ahmedaly14y ago

My instances are not down too.. I will back it up now in case things go bad.

NathanKP14y ago

I am experiencing two out of four instances in us-east-1e unreachable.

misiti378014y ago

my instances in us-east-1c are fine

malachismith14y ago

Goat rodeo.

cupcake_death14y ago

Yep - Forums are exploding

j / k navigate · click thread line to collapse

94 comments

62 comments · 24 top-level

mikebo14y ago· 11 in thread

Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

keithnoizu14y ago

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

shiftpgdn14y ago

At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.

3 more replies

nadahalli14y ago

I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.

2 more replies

werkshy14y ago

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

gouranga14y ago

That sucks badly.

Sticky situation.

TazeTSchnitzel14y ago

Can't you tell management that it isn't as reliable as they claim?

1 more reply

its_so_on14y ago

If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?

1 more reply

malachismith14y ago

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

rabbitfang14y ago

The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.

res0nat0r14y ago

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

mikebo14y ago

No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

1 more reply

stevefink14y ago· 6 in thread

Cpu0 : 0.3%us, 0.0%sy, 0.0%ni, 0.0%id, 99.7%wa, 0.0%hi, 0.0%si, 0.0%st <-- EBS subsystem is completely unreachable. I/O wait times are tanked across the board for me (I'm in US-EAST-1).

nirvdrum14y ago

What zone? I really wish Amazon would provide that info, instead of saying that it only affects one zone.

gabrtv14y ago

AFAIK zones are randomized. 1a for me is 1d for you.

3 more replies

aaronharder14y ago

Unfortunately, there is no meaningful way for them to say which zone because zone labels are different for each user.

iharris14y ago

I have instances in two different zones - both are down, although I don't know if AWS's randomization means that my 1a and 1d are actually located in the same logical zone.

1 more reply

stevefink14y ago

Both my MySQL master (I'm not using RDS) and Redis Master servers are affected and are located in zone us-east-1a.

1 more reply

zedwill14y ago

It is affecting me on US-EAST-1B

bad_user14y ago· 4 in thread

I got notified by Pingdom that my domain was down before AWS had any info on that status page of theirs. IMHO, they should improve on the latency of their alerts.

NathanKP14y ago

Same here. In fact, the AWS dashboard was still showing 2/2 checks passed for some 20 minutes after Pingdom told me my site was down.

Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.

RegEx14y ago

I've learned to ignore the checks passed for quite a while, especially for servers on load balance.

iharris14y ago

SNS sent me an e-mail of my instance alarms pretty quickly.

keithnoizu14y ago

By over fifteen minutes in my case. Possibly thirty. WTH.

pearle14y ago· 4 in thread

Anyone have any details on why us-east-1 seems to be less reliable than the other regions? Is it the oldest?

jaylevitt14y ago

According to this calculation (which attempted to probe all the racks in EC2), over 70% of EC2 lives in us-east.

http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...

sausagefeet14y ago

I'm under the impression it's the most used.

NoPiece14y ago

edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.

2 more replies

malachismith14y ago

It's the oldest, yes.

rdl14y ago· 3 in thread

I'm curious why no public paas is multiple AWS region.

malachismith14y ago

1) because AWS East is so much cheaper (and none of us like spending money) 2) AppFog actually is multi region (and multi IaaS as well)

kanwisher14y ago

Oregon is same as AWS East, seems to have a smaller set of boxes, have gotten errors in the past about not having any more servers to allocate.

1 more reply

rdl14y ago

I'd tolerate multi-AZ as a baseline.

Thanks for AppFog -- I hadn't heard of them, but will check them out.

rabble14y ago· 3 in thread

Good time to consider Google's Compute Engine as an alternative? What will we call it, GCE?

jfoutz14y ago

currently, it is a limited beta. Also, it looks to be more expensive.

malachismith14y ago

1 more reply

vachi14y ago

ahh Acronyms

tolos14y ago· 2 in thread

Every time (two out of two), by the time I click on "X is down" link, the service/website is working again. Surely there is a better platform for alerting about outages than ycombinator?

pjscott14y ago

Pingdom does a good job of it, if you point it at a public-facing web site you particularly care about. I'm not affiliated with them; I've just been woken up by them.

bmelton14y ago

I was down for approximately three hours this morning. I don't know when this submission was posted, but I made one shortly after discovering the outage myself.

KenCochrane14y ago· 2 in thread

KenCochrane14y ago

I'm still seeing issues, some instances that aren't starting, and others I'm still not able to connect to. So I'm not sure what they are talking about.

bad_user14y ago

For what is worth, my small website is online again.

keithnoizu14y ago· 1 in thread

I feel like you can't really say you're in the green when you still have customers unable to use your service. My instance is still stuck in failover.

gooeyblob14y ago

Absolutely agree - that's just silly. Their status page is close to useless.

mattwdelong14y ago· 1 in thread

It's not entirely down as I can still access my instances. I'm in us-east-1b.

grourk14y ago

Your us-east-1b might be my us-east-1a.

zedwill14y ago· 1 in thread

Interesting enough not only the EBS is down, but ELB can not register instances even if there are not EBS based and completely operational.

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

oasisbob14y ago

I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.

ELBs are sometimes EBS backed.

gregholmberg14y ago

Individual availability zones can be identified using the API.

   ec2-describe-reserved-instances-offerings --region

will tell you what the zone's identifier is.

After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.

This Alestic article shows how to label them all.

[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones

dwhsix14y ago

Keep in mind AZs are different per account. My us-east-1b is not necc'ly your us-east-1b (as someone reminded me on twitter just now).

pjscott14y ago

EC2 comes with a free Chaos Monkey service. It's called EC2.

It's great when you suddenly need a hundred more servers, though.

pearle14y ago

I'm running in us-east-1 and my EC2 instances and EBS volumes are still responding ok for the moment...

Fingers crossed (just deployed to AWS less than 2 weeks ago).

DigitalSea14y ago

mattbillenstein14y ago

I suggest until Amazon uses RDS For their database - that you don't either...

anuraj14y ago

Mine is okay

ahmedaly14y ago

dotcloud was down also but its now up. (they rely on ec2)

ahmedaly14y ago

My instances are not down too.. I will back it up now in case things go bad.

NathanKP14y ago

I am experiencing two out of four instances in us-east-1e unreachable.

misiti378014y ago

my instances in us-east-1c are fine

malachismith14y ago

Goat rodeo.

cupcake_death14y ago

Yep - Forums are exploding

j / k navigate · click thread line to collapse