Amazon says:
Jun 15, 4:03 AM PDT The RDS service is now operating normally. All affected Multi-AZ RDS instances operated normally throughout the power event after failing over. We were able to recover many Single AZ instances successfully, but storage volumes attached to some Single-AZ instances could not be restored, resulting in those instances being placed in Storage failure mode. Customers with automated backups turned on for an affected database instance have the option of initiating a Point-in-Time Restore operation. This will launch a new database instance using a backup of the affected database instance from before the event. To do this, follow these steps: 1) Log into the AWS Management console 2) Access the RDS tab, and select DB Instances on the left-side navigation 3) Select the affected database instance 4) Click on the "Restore to Point in Time" button 5) Select "Use Latest Restorable Time 6) Select a DB instance class that is at least the same size as the original DB instance 7) Make sure No Preference is selected for Availability Zone 8) Launch DB Instance and connect your application We will be following up here with the root cause of this event.
At the same time, if you've not spotted by now that EBS (elastic block storage, which powers RDS) is not reliable and not to be trusted, then you have to look at yourself too.
EBS is by far the worst product AWS offer, you simply should not use it without a very good reason, and if you do need to use it, you have to assume any given drive image will disappear at any moment - as it did here.
Beyond that, any time you're running a database, no matter who the provider is, if you're not doing backups every day or hour, then you're not doing things right.
A real engineer should know better, but otherwise, it's people trusting what a major company claims.
If it was a fly by night organization I would totally agree with you, but Amazon is a major multinational. It seems to me as reasonable for an outsider to trust the claims they make as it is for me, a car industry outsider, to trust the claims that Chevrolet makes about my car.
Do you know how to properly quality evaluate everything in your life?
Hope you don't need a doctor, lawyer, or plumber soon.
However, for EBS volumes, Amazon is very clear about the expected data loss rate:
"The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives."
What it does say is this:
"As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume.
Which to any engineer, "real" or otherwise, should be a pretty strong signal that something bad will happen.
Where do they do this? All the docs I've seen are pretty clear that EC2/EBS stuff could disappear and that you have to plan a fault-tolerant system.
EBS is fine as long as you're aware of the constraints and plan accordingly.
you have to assume any given drive image will disappear at any moment
Duh! You mean it behaves exactly as documented? That is outrageous.
They cluster their databases across multiple instances, availability zones and regions, and back them up constantly to S3. Their ultimate recovery plan is to restore from backup if necessary, which is pretty much the same as assuming EBS will corrupt your data.
While it's far from trivial to do it that way, they seem to be the most successful user of the AWS systems.
You can also raid EBS volumes together for realtime redundancy, but adds cost and complexity.
For people who don't want to spend the money on a hot standby, you can still produce the database log files from MySQL (or whatever you're using), and copy them off the machine onto S3 every few seconds, minutes, etc.
There's no end of solutions to the problem but, for whatever reason, a lot of people don't see EBS as a potential problem area.
Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.
Which failure? The networking issue which had nothing to do with RDS and left your data unaffected?
> Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.
Any server can go down. The very modest increase for a multi-AZ setup buys you real, meaningful improvements as you just learned. I'm sorry that you had to learn a lesson the hard way but there's a reason why AWS recommends a multi-AZ deployment for failover and it's not revenue.
The next step up would require you having multiple widely separated servers, which is where you really start talking about large amounts of money because you're talking about non-trivial engineering and taking on the operational overhead of 24x7 support.
You mean you think all cloud db solutions should implement replication for you? There aren't very many backup solutions that never lose any data.
Good backups are the best defense.
The reason I moved to AWS was because when I added a new (cheap) hard drive to my dedi, they didn't put a partition table on the disk. When the machine rebooted, the superblock got overwritten and I lost access to the file system. (I did manage to recreate the superblock and get the data out, but jeese...)
One time I made a ticket and somehow my record in the trouble ticket system got screwed up and I couldn't put more tickets in. Some wizard fixed that in the SQL monitor after I talked to 3 other people who had no idea this could happen.
As for costs it's not so simple. I've got a processing job I run each week that costs $6 of CPU time because I pay just for what I need. I wouldnt want to run it on a dedi because it's a beast.
Could you explain that a bit more?
If you were running your own database, you surely would have had rigorous backups because the responsibility was on you.
Assume that if a service can fail, it will. If data can be lost, it will be. Then, plan accordingly.
EDIT: grammar