Amazon is not cheap, and they have failed way too many times in recent memory.
But the api, oh the api - it's crack, and I can't live without it.
We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.
Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.
Sticky situation.
Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.
Then the AWS dashboard finally updated and told me that 3 minutes ago my instances became unreachable. That is pretty poor. AWS should be able to know right away and email me themselves.
EDIT: My status checks were slow to update like the sibling comment stated, although the alarms that measure system resources triggered almost immediately when everything blew up. I think the status checks refresh at a certain interval, but those aren't really meant for real-time monitoring AFAIK.
http://huanliu.wordpress.com/2012/03/13/amazon-data-center-s...
edit: btw, I am not dismissing "used more" as a valid theory. More use = more hardware = more complexity which could lead to more failures.
Either way, if you're using RDS, even if this didn't affect you, it's discussion-worthy. I was affected, and we're building a not-yet-launched product that allows us the time to consider "Is Amazon really where we want to be?". The more failure I'm aware of, the more informed that decision is.
"9:39 AM PDT Networking connectivity has been restored to most of the affected RDS Database Instances in the single Availability Zone in the US-EAST-1 region. New instance launches are completing normally. We are continuing to work on restoring connectivity to the remaining affected RDS Database Instances."
I have some live instances running without EBS disks that I can not place behind the ELB as it is not working.
ELBs are sometimes EBS backed.
ec2-describe-reserved-instances-offerings --region
will tell you what the zone's identifier is.After you list the permanent identifiers, you can match them up to find out if your us-east-1a matches my -1d.
This Alestic article shows how to label them all.
[0] "Matching EC2 Availability Zones Across AWS Accounts" http://alestic.com/2009/07/ec2-availability-zones
I know, they're trying to make it reliable and they've got a bunch of very hard problems to solve. That doesn't change the fact that sometimes some of my servers just permanently stop responding to pings until you stop-start them, or get crazy-slow I/O, or get hit by these once-in-a-while-and-always-at-night outages.
It's great when you suddenly need a hundred more servers, though.
Fingers crossed (just deployed to AWS less than 2 weeks ago).