Amazon RDS failure - data has been lost

82 pointsakhkharu14y ago47 comments

Our RDS instance is in "failure" state after 8 hours of downtime. Have to restore from point in time backup which does not have actual data.

Amazon says:

Jun 15, 4:03 AM PDT The RDS service is now operating normally. All affected Multi-AZ RDS instances operated normally throughout the power event after failing over. We were able to recover many Single AZ instances successfully, but storage volumes attached to some Single-AZ instances could not be restored, resulting in those instances being placed in Storage failure mode. Customers with automated backups turned on for an affected database instance have the option of initiating a Point-in-Time Restore operation. This will launch a new database instance using a backup of the affected database instance from before the event. To do this, follow these steps: 1) Log into the AWS Management console 2) Access the RDS tab, and select DB Instances on the left-side navigation 3) Select the affected database instance 4) Click on the "Restore to Point in Time" button 5) Select "Use Latest Restorable Time 6) Select a DB instance class that is at least the same size as the original DB instance 7) Make sure No Preference is selected for Availability Zone 8) Launch DB Instance and connect your application We will be following up here with the root cause of this event.

82 pointsakhkharu14y ago47 comments

Our RDS instance is in "failure" state after 8 hours of downtime. Have to restore from point in time backup which does not have actual data.

Amazon says:

47 comments

28 comments · 7 top-level

EwanToo14y ago· 11 in thread

RDS should not have lost data, and if I were a user of it, I'd be annoyed too.

At the same time, if you've not spotted by now that EBS (elastic block storage, which powers RDS) is not reliable and not to be trusted, then you have to look at yourself too.

EBS is by far the worst product AWS offer, you simply should not use it without a very good reason, and if you do need to use it, you have to assume any given drive image will disappear at any moment - as it did here.

Beyond that, any time you're running a database, no matter who the provider is, if you're not doing backups every day or hour, then you're not doing things right.

JohnHaugeland14y ago

They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

A real engineer should know better, but otherwise, it's people trusting what a major company claims.

If it was a fly by night organization I would totally agree with you, but Amazon is a major multinational. It seems to me as reasonable for an outsider to trust the claims they make as it is for me, a car industry outsider, to trust the claims that Chevrolet makes about my car.

Do you know how to properly quality evaluate everything in your life?

Hope you don't need a doctor, lawyer, or plumber soon.

electrum14y ago

Amazon only claims "obscene number of nines" for S3 durability (99.999999999%). And this claim seems to be accurate: I've never seen a publicly reported case of anyone losing data. Anytime you read their forums about people reporting data loss, a typical response is AWS staff saying "we see delete requests for those objects on date X" with the users responding "oh, oops, we had this background delete process".

However, for EBS volumes, Amazon is very clear about the expected data loss rate:

"The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives."

1 more reply

efsavage14y ago

I don't remember ever seeing "obscene numbers of nines" claimed by Amazon. For S3, yes, but not EBS. The '9' character doesn't even appear once on http://aws.amazon.com/ebs/.

What it does say is this:

"As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume.

Which to any engineer, "real" or otherwise, should be a pretty strong signal that something bad will happen.

1 more reply

ceejayoz14y ago

> They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

Where do they do this? All the docs I've seen are pretty clear that EC2/EBS stuff could disappear and that you have to plan a fault-tolerant system.

moe14y ago

you simply should not use it without a very good reason

EBS is fine as long as you're aware of the constraints and plan accordingly.

you have to assume any given drive image will disappear at any moment

Duh! You mean it behaves exactly as documented? That is outrageous.

jahewson14y ago

You get what you pay for. Single-AZ databases were lost due to a failure in a single AZ, which Amazon tells you will happen. If you want durability you need Multi-AZ, which is the only place I'd put a production database.

kennu14y ago

What would you recommend for persistent disk storage on AWS instead of EBS then? Assuming you need to put database files somewhere where they don't disappear when instances terminate, and your database can access them.

EwanToo14y ago

The Netflix approach (which definitely isn't for everyone), is to use no persistent storage.

They cluster their databases across multiple instances, availability zones and regions, and back them up constantly to S3. Their ultimate recovery plan is to restore from backup if necessary, which is pretty much the same as assuming EBS will corrupt your data.

While it's far from trivial to do it that way, they seem to be the most successful user of the AWS systems.

1 more reply

electrum14y ago

You use EBS and back it up using the built-in snapshot functionality, and/or back it up another way. RDS, which is presumably based on EBS, provides backups and snapshots for precisely this reason.

snorkel14y ago

You can snapshot EBS volumes, however your EBS disk I/O will be blocked while the snapshot is being made, so you can't do it often on a large busy volume.

You can also raid EBS volumes together for realtime redundancy, but adds cost and complexity.

EwanToo14y ago

True, you're probably better off running the replication at the database layer, which can normally by either synchronous or asynchronous depending on what people need.

For people who don't want to spend the money on a hot standby, you can still produce the database log files from MySQL (or whatever you're using), and copy them off the machine onto S3 every few seconds, minutes, etc.

There's no end of solutions to the problem but, for whatever reason, a lot of people don't see EBS as a potential problem area.

justincormack14y ago· 4 in thread

Use multi AZ then, which performed as expected. There have been so many warnings about single AZ that you would hope people get it by now.

akhkharuOP14y ago

As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

acdha14y ago

> As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Which failure? The networking issue which had nothing to do with RDS and left your data unaffected?

> Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

Any server can go down. The very modest increase for a multi-AZ setup buys you real, meaningful improvements as you just learned. I'm sorry that you had to learn a lesson the hard way but there's a reason why AWS recommends a multi-AZ deployment for failover and it's not revenue.

The next step up would require you having multiple widely separated servers, which is where you really start talking about large amounts of money because you're talking about non-trivial engineering and taking on the operational overhead of 24x7 support.

1 more reply

nevinera14y ago

>Cloud solutions even with single AZ should not loss data.

You mean you think all cloud db solutions should implement replication for you? There aren't very many backup solutions that never lose any data.

4 more replies

snorkel14y ago

Sure, just duplicate your entire stack in at least two zones and replicate all data in realtime, and then you just need to convince your boss/investors/yourself that spending 2X on hosting is worth it. Once you add up the costs you soon realize that risking several hours of downtime once per year is more acceptable than doubling hosting costs. For anyone who is not hosting air traffic control or banking systems it's really not worth it. Seriously, if your web service is hosting social brain farts or selling cup holders, and not landing a space shuttles, then it can be offline a for few hours per year.

PaulHoule14y ago· 3 in thread

If you had a database running on a dedi you could get trashed by a server failure too.

Good backups are the best defense.

shiftpgdn14y ago

To me this is a fallacious argument. Dedicated servers are wildly cheaper than RDS/AWS. Isn't that the whole point of AWS? To have a team of experts managing your hosting to prevent a failure like this?

PaulHoule14y ago

I find it's very expensive to talk with salespeople to get my dedis configured properly. Particularly when they screw it up anyway.

The reason I moved to AWS was because when I added a new (cheap) hard drive to my dedi, they didn't put a partition table on the disk. When the machine rebooted, the superblock got overwritten and I lost access to the file system. (I did manage to recreate the superblock and get the data out, but jeese...)

One time I made a ticket and somehow my record in the trouble ticket system got screwed up and I couldn't put more tickets in. Some wizard fixed that in the SQL monitor after I talked to 3 other people who had no idea this could happen.

As for costs it's not so simple. I've got a processing job I run each week that costs $6 of CPU time because I pay just for what I need. I wouldnt want to run it on a dedi because it's a beast.

1 more reply

akhkharuOP14y ago

Yeah, but I have more control over it to prevent such failures.

purephase14y ago· 3 in thread

I'm not sure I understand the "which does not have actual data" part of your statement.

Could you explain that a bit more?

akhkharuOP14y ago

Point in Time backup was created before the actual failure and does not contain the latest data (~ 1 hour).

philjohn14y ago

Surely you realised that is the case with a point-in-time backup? If you absolutely cannot lose data then as others have said, Multi AZ is required, or, at the very least, have a transaction log that you can replay (once again, hosted somewhere else).

horatiumocian14y ago

Probably he means that the data from the backup is not up-to-date (i.e. not actual).

bananashake14y ago

Why do you think the "Restore to Point in Time" failed to work? That puzzles me the most in this catastrophe and no has addressed it. In theory with Point-in-Time restoration you should not lose data from a failure on just the storage where the InnoDB is stored.

mschalle14y ago

Always assume Murphy's law will hold, regardless of what service provider you use.

If you were running your own database, you surely would have had rigorous backups because the responsibility was on you.

Assume that if a service can fail, it will. If data can be lost, it will be. Then, plan accordingly.

EDIT: grammar

debacle14y ago

But but...the cloud.

j / k navigate · click thread line to collapse

47 comments

28 comments · 7 top-level

EwanToo14y ago· 11 in thread

RDS should not have lost data, and if I were a user of it, I'd be annoyed too.

At the same time, if you've not spotted by now that EBS (elastic block storage, which powers RDS) is not reliable and not to be trusted, then you have to look at yourself too.

Beyond that, any time you're running a database, no matter who the provider is, if you're not doing backups every day or hour, then you're not doing things right.

JohnHaugeland14y ago

They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

A real engineer should know better, but otherwise, it's people trusting what a major company claims.

Do you know how to properly quality evaluate everything in your life?

Hope you don't need a doctor, lawyer, or plumber soon.

electrum14y ago

However, for EBS volumes, Amazon is very clear about the expected data loss rate:

1 more reply

efsavage14y ago

I don't remember ever seeing "obscene numbers of nines" claimed by Amazon. For S3, yes, but not EBS. The '9' character doesn't even appear once on http://aws.amazon.com/ebs/.

What it does say is this:

Which to any engineer, "real" or otherwise, should be a pretty strong signal that something bad will happen.

1 more reply

ceejayoz14y ago

> They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

Where do they do this? All the docs I've seen are pretty clear that EC2/EBS stuff could disappear and that you have to plan a fault-tolerant system.

moe14y ago

you simply should not use it without a very good reason

EBS is fine as long as you're aware of the constraints and plan accordingly.

you have to assume any given drive image will disappear at any moment

Duh! You mean it behaves exactly as documented? That is outrageous.

jahewson14y ago

kennu14y ago

EwanToo14y ago

The Netflix approach (which definitely isn't for everyone), is to use no persistent storage.

While it's far from trivial to do it that way, they seem to be the most successful user of the AWS systems.

1 more reply

electrum14y ago

You use EBS and back it up using the built-in snapshot functionality, and/or back it up another way. RDS, which is presumably based on EBS, provides backups and snapshots for precisely this reason.

snorkel14y ago

You can snapshot EBS volumes, however your EBS disk I/O will be blocked while the snapshot is being made, so you can't do it often on a large busy volume.

You can also raid EBS volumes together for realtime redundancy, but adds cost and complexity.

EwanToo14y ago

True, you're probably better off running the replication at the database layer, which can normally by either synchronous or asynchronous depending on what people need.

There's no end of solutions to the problem but, for whatever reason, a lot of people don't see EBS as a potential problem area.

justincormack14y ago· 4 in thread

Use multi AZ then, which performed as expected. There have been so many warnings about single AZ that you would hope people get it by now.

akhkharuOP14y ago

As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

acdha14y ago

> As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Which failure? The networking issue which had nothing to do with RDS and left your data unaffected?

> Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

1 more reply

nevinera14y ago

>Cloud solutions even with single AZ should not loss data.

You mean you think all cloud db solutions should implement replication for you? There aren't very many backup solutions that never lose any data.

4 more replies

snorkel14y ago

PaulHoule14y ago· 3 in thread

If you had a database running on a dedi you could get trashed by a server failure too.

Good backups are the best defense.

shiftpgdn14y ago

PaulHoule14y ago

I find it's very expensive to talk with salespeople to get my dedis configured properly. Particularly when they screw it up anyway.

As for costs it's not so simple. I've got a processing job I run each week that costs $6 of CPU time because I pay just for what I need. I wouldnt want to run it on a dedi because it's a beast.

1 more reply

akhkharuOP14y ago

Yeah, but I have more control over it to prevent such failures.

purephase14y ago· 3 in thread

I'm not sure I understand the "which does not have actual data" part of your statement.

Could you explain that a bit more?

akhkharuOP14y ago

Point in Time backup was created before the actual failure and does not contain the latest data (~ 1 hour).

philjohn14y ago

horatiumocian14y ago

Probably he means that the data from the backup is not up-to-date (i.e. not actual).

bananashake14y ago

mschalle14y ago

Always assume Murphy's law will hold, regardless of what service provider you use.

If you were running your own database, you surely would have had rigorous backups because the responsibility was on you.

Assume that if a service can fail, it will. If data can be lost, it will be. Then, plan accordingly.

EDIT: grammar

debacle14y ago

But but...the cloud.

j / k navigate · click thread line to collapse