Heroku's AWS outage post-mortem (opens in new tab)

(status.heroku.com)

202 pointsmileszs15y ago80 comments

80 comments

45 comments · 18 top-level

ekidd15y ago· 9 in thread

Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.

In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:

3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.

Combined with multi-region support, this should make Heroku far more resilient in the future.

callmeed15y ago

Kudos? For nothing but the words "heroku takes 100% of the responsibility ..."?

Sorry, but that's not cutting it for me right now. I pay Heroku $250 a month and I was down for 60 hours (not 16). Our app isn't even out of private beta so I fully expected to be paying Heroku $2-3K/month by the end of the year. Now, I'm not sure I'll stay.

If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime).

ekidd15y ago

I've run both cloud and non-cloud applications. In my experience, you won't get 99.95% annual uptime over 5 years without a full-time sysadmin, the ability to provision a complete offsite infrastructure and fail over to it within a few hours, and a backup/restore process that you rigorously test every month or so.

You generally won't get all that for $2K to $3K a month. Sure, you can drop $15K on an expensive database server, and co-locate it somewhere. But that only works until somebody takes a backhoe to your fiber, your RAID controller fails catastrophically, somebody pwns your production server, your sysadmin flakes out, or you discover that your backup scripts have been broken for months.

Realistically, if you're only spending $2-3K per month on hosting and administration, you'll eventually experience one or more of the above, and your site may be down for a day or more.

This isn't to say that I'm happy about Heroku's long downtime. One of my clients was offline almost as long as you were. But I'm pleased that Heroku recognizes just how badly they screwed up, and that they're taking the two most important steps they can to prevent a recurrence: multi-region support, and continuous backups for everyone. Multi-region support may not be sufficient to protect against cascading Amazon outages, but it's a good start.

ghshephard15y ago

"If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime)."

That's implied, at the very least, by the phrase "heroku takes 100% of the responsibility. "

I would be very surprised if they didn't offer more than that.

toast7615y ago

Your biggest concern is you want a $20 refund?

They can keep their $20 in my view, as long as they ensure it never happens again.

1 more reply

Joakal15y ago

Do they have a SLA? I presume they can pass on some of the refunds from Amazon.

railsguy115y ago

You're going to complain about a refund of what, ~ $22, from a service with no SLA for a site in 'private beta'?

You must think pretty highly of your app. Why don't you take it out of 'private beta' and let the rest of us look at it?

tmeasday15y ago

Is this reasonable? I'm sure a lot of amazon hosted companies are thinking similar thoughts.

But being on multiple Availability Zones was supposed to be bulletproof (according to Amazon). Now that we know that wasn't the case, is being hosted on multiple regions going to provide the necessary level of protection?

Is it an over-reaction to say that relying completely on Amazon could now be seen as irresponsible to your users, given the magnitude of this event?

PakG115y ago

The one problem I have with such concerns is this: what other viable options are there? Google AppSpot? Windows Azure? Perhaps. But AWS is flexible, very few stack limitations. The only other alternative I think is to go back to the pre-cloud era, when hosting was much more expensive, and outages were still possible, especially when you couldn't keep up with big traffic spikes.

Honestly, I would prefer this kind of mass outage than the alternative. It's cheaper, easier, and I bet you there's still better uptime overall.

4 more replies

seandougall15y ago

We could just as easily say that relying completely on Heroku is irresponsible to our own users.

Just as Heroku took responsibility for the unexpected weaknesses their reliance on a single region created, I believe their customers should take responsibility for the unexpected weaknesses our reliance on a single hosting provider has created.

Heroku still has the value of added resiliency, even if it's not 110% bulletproof. Ultimately, we're responsible for the architecture design of our own sites.

oomkiller15y ago· 5 in thread

I'd really love to know some details on the continuous backup stuff. Sounds cool.

bbatsell15y ago

Not sure why it was dead-ed (possibly a double-post), but here's the answer from an author of it in case you don't have showdead on:

fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem

The mechanism is PostgreSQL continuous archiving.

http://github.com/heroku/wal-e

This tool is still quite nascent. It received quite a trial by fire, having not (before this point) been revealed as a value-added feature to the service just yet in a wide scale.

joevandyk15y ago

I started using WAL-E a couple days ago for one of my own sites.

WAL-E is a program that postgresql can use to push database changes to S3.

Depending on how you configure postgresql checkpoints, the most data you'd lose is somewhere between to a couple seconds to a minute. I'd assume Heroku would make it a couple seconds. The downside to more frequent backups is more storage space (each checkpoint (WAL archive) stored on S3 is a minimum of 75k or so, even if there weren't any changes).

1 more reply

fdr15y ago

Yes. a double post.

mechanical_fish15y ago

If I'm translating it correctly, this phrase is referring to database replication.

I'm not deeply familiar with the Postgres versions of that (to my regret), but for the MySQL version you can read something like this:

http://dev.mysql.com/doc/refman/5.0/en/replication.html

Better yet, find yourself a copy of High Performance MySQL.

fdr15y ago

WAL archive replay is also used for the PostgreSQL hot standby feature, aka replication. Combined with streaming you can get sub-second latency, but there's no reason you could just not use streaming and use WAL-E to syndicate WAL to thousands of replicants with hot standby enabled (albeit with lag). Use your imagination if you want to write a event-driven, high performance WAL streaming server. I haven't found the use case yet.

MySQL has long relied on statement based replication, which can lead to server drift in the case of any nondeterministic query. This is a total killer for extensibility of the database, as well as correctness in general. It also has a row-based-replication variant that showed up around 2008 that represents a significant improvement, but the search results for "mysql rbr" might give you pause...

My guess is this is why Amazon made the sensible choice to back their RDS (MySQL) product via DRBD and synchronous, block-device level replication, because there is no good application-level option for MySQL that is to be trusted. This technique can also be used with PostgreSQL. However, use of DRBD tends to have punishing performance impact, is complicated, and is not very suitable for a hot standby unless you write very complex shared-storage database software like Oracle RAC, hence why so much effort went into WAL-streaming hot standby in the PostgreSQL community. The DRBD option is venerable, dating back to the LiveJournal early days as their MySQL HA option, and probably before that (Credit to LiveJournal for well-documenting their HA setup, including their use of DRBD).

dpcan15y ago· 4 in thread

What the hell? Why is everyone taking responsibilty and giving amazon a free ride? I'm a firm believer that only victims make excuses, and it's admirable to take responsibility, and maybe they should have more redundancy in place, but the way aws has been advertised, most of us felt this kind of thing should never happen even without a 100% uptime guarantee.

So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.

dholowiski15y ago

I pay Heroku to host my rails apps, not Amazon. I don't give a flying fk what kind of back end infrastructure they use, as long as people can get to my app.

It is surprising they don't talk refunds for the downtime, if they are taking responsibility. I'd imagine we will see this coming soon?

dpcan15y ago

So you are saying if they are taking 100% of the responsibility they assume 100% of the liability? This is exactly why I'm suggesting they may have put their foot in their mouth. SOME of the fault reasonably lies with Amazon in my opinion, and I personally would not have cared if they took 50%. That's all really.

2 more replies

spicyj15y ago

Because Heroku isn't positioned as a way to manage your EC2 instances, it's a whole hosting solution. How much more confusing would it be for customers if Heroku said that they take responsibility for downtime that's not Amazon's fault? What counts as Amazon's fault? What about customers that know only Rails and don't care about how Heroku works on the backend?

dpcan15y ago

I get what they are doing, and I would probably handle it the same way, but I think it is perfectly ok for them to place some blame here as their downtime WAS because aws was down technically. I think they took it too far by taking 100% of the blame. I'm on heroku's side, I don't think it was fair to themselves to take 100% pf the blame.

1 more reply

adriand15y ago· 3 in thread

I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."

It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.

It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.

theoj15y ago

Do they have another option? If no site had survived the outage, then they may have been on to something. But with some sites surviving the outage, they just have no excuse.

Sure you can blame AWS because they said multiple availability zones in the same region would work. But at the same time there is an expectation that a site like Heroku is knowledgeable enough and sophisticated enough to intelligently process what AWS says and determine what's appropriate for them.

jarin15y ago

It's sort of counter-intuitive, but taking responsibility for something (even if it's not directly your fault) often has the effect of deflecting some of the anger from your customers/clients/boss/etc.

Personally, I prefer to just get the blame part out of the way by taking responsibility and concentrate on the important things: fixing the problem and making sure it doesn't happen again.

I think that deep down, people aren't that concerned with who's fault it was. They just want to know that someone is going to fix it.

ghshephard15y ago

The reason why you don't want to take responsibility, is that the liability comes along with that. If Heroku took the position that the AWS outage was a force de majeure, then their liability for recompense to their customers would have been minimized.

By suggesting they take responsibility, they also are in a position where they have to make good for all of the downtime their customers experienced.

Short term - that will be an expensive decision. Long term, I think it's the right thing to do. It certainly builds up my confidence level in them.

1 more reply

awicklander15y ago· 2 in thread

They gloss over their biggest failure; they weren't communicating or interacting with their customers at all.

* http://twitter.com/#!/heroku * http://twitter.com/#!/herokustatus

daniel0221615y ago

They show a history of updates to their status blog and to the herokustatus Twitter account since April 21, what do you mean by 'they weren't communicating'?

runningdogx15y ago

From 9:07 to 20:43, the status updates were generic and not very helpful in answering two questions customers want to know:

1. What exactly is going on? 2. When will it be fixed?

In the middle of a crisis, saying "we're aware of the problem, and we're working hard to fix it," for hours does not really count as communication. It increases customer aggravation rather than decreasing it. Customers want to know answers to the above two questions. They don't care that you know about the problem and that you're working on it, unless you're not doing those two things in which case they will be (and should be) furious; those two things are expected.

Barring the ability to tell your customers "we will be back up at X:00", I think the best approach is to share as much information as you can without getting into proprietary information. That's why I think GP considered their communication a failure. That's why I consider their communication a failure, although I've seen this pattern enough from different companies that I don't hold it against Heroku as long as they learn from it.

1 more reply

trezor15y ago· 2 in thread

And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.

In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.

showerst15y ago

I think boneheaded is a strong word. They're solving a problem that very few sites have to solve (huge traffic with low cache-ability) with vastly less resources than the others who do solve them have. (FB, Twitter, etc).

AMZN in general is a pretty solid 'platform' (especially if you're not using EBS), but because this whole 'cloud' thing is still partially uncharted territory, there are still holes, and you can't treat it like a normal web host.

akl15y ago

Are you seriously not aware that reddit falls over pretty much all the time on its own?

watchandwait15y ago· 1 in thread

The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.

watchandwait15y ago

UPDATE: we were fully restored after midnight last night. It is a very happy feeling!

greattypo15y ago· 1 in thread

It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..

JonWood15y ago

Given that Heroku charges based on the time your application is up I wouldn't be surprised if everyone just gets a bill which doesn't include the time their sites were offline.

chrishenn15y ago

Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem. Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.

The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.

I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.

http://en.wikipedia.org/wiki/Incident_command_system

waxman15y ago

Thank you for taking full responsibility.

Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.

markbao15y ago

I wish Amazon was as good at communication and accountability as Heroku is.

chrisbaglieri15y ago

"Block storage is not a cloud-friendly technology".

Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.

bdb15y ago

Where is Amazon's?

1 more reply

chubs15y ago

This is why i love hosting on heroku: they'll work their butt off to get it fixed when its down, and i don't have to lift a finger. However, EBS has been long known to be a turd, its a pity they relied on it. Plus, if they had a way to bring it back up in a different region (eg the euro AWS infrastructure) at the flick of a switch, that'd make me less nervous...

AffableSpatula15y ago

I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.

Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.

Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.

Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.

metageek15y ago

>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing

Heroku should save the customers this pain, by setting up anycast:

https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...

chrisbaglieri15y ago

I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is exactly what me as a paying customer wants to hear.

Kudos!

mtw15y ago

what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues

j / k navigate · click thread line to collapse

80 comments

45 comments · 18 top-level

ekidd15y ago· 9 in thread

Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.

In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:

Combined with multi-region support, this should make Heroku far more resilient in the future.

callmeed15y ago

Kudos? For nothing but the words "heroku takes 100% of the responsibility ..."?

If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime).

ekidd15y ago

Realistically, if you're only spending $2-3K per month on hosting and administration, you'll eventually experience one or more of the above, and your site may be down for a day or more.

ghshephard15y ago

"If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime)."

That's implied, at the very least, by the phrase "heroku takes 100% of the responsibility. "

I would be very surprised if they didn't offer more than that.

toast7615y ago

Your biggest concern is you want a $20 refund?

They can keep their $20 in my view, as long as they ensure it never happens again.

1 more reply

Joakal15y ago

Do they have a SLA? I presume they can pass on some of the refunds from Amazon.

railsguy115y ago

You're going to complain about a refund of what, ~ $22, from a service with no SLA for a site in 'private beta'?

You must think pretty highly of your app. Why don't you take it out of 'private beta' and let the rest of us look at it?

tmeasday15y ago

Is this reasonable? I'm sure a lot of amazon hosted companies are thinking similar thoughts.

Is it an over-reaction to say that relying completely on Amazon could now be seen as irresponsible to your users, given the magnitude of this event?

PakG115y ago

Honestly, I would prefer this kind of mass outage than the alternative. It's cheaper, easier, and I bet you there's still better uptime overall.

4 more replies

seandougall15y ago

We could just as easily say that relying completely on Heroku is irresponsible to our own users.

Heroku still has the value of added resiliency, even if it's not 110% bulletproof. Ultimately, we're responsible for the architecture design of our own sites.

oomkiller15y ago· 5 in thread

I'd really love to know some details on the continuous backup stuff. Sounds cool.

bbatsell15y ago

Not sure why it was dead-ed (possibly a double-post), but here's the answer from an author of it in case you don't have showdead on:

fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem

The mechanism is PostgreSQL continuous archiving.

http://github.com/heroku/wal-e

This tool is still quite nascent. It received quite a trial by fire, having not (before this point) been revealed as a value-added feature to the service just yet in a wide scale.

joevandyk15y ago

I started using WAL-E a couple days ago for one of my own sites.

WAL-E is a program that postgresql can use to push database changes to S3.

1 more reply

fdr15y ago

Yes. a double post.

mechanical_fish15y ago

If I'm translating it correctly, this phrase is referring to database replication.

I'm not deeply familiar with the Postgres versions of that (to my regret), but for the MySQL version you can read something like this:

http://dev.mysql.com/doc/refman/5.0/en/replication.html

Better yet, find yourself a copy of High Performance MySQL.

fdr15y ago

dpcan15y ago· 4 in thread

So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.

dholowiski15y ago

I pay Heroku to host my rails apps, not Amazon. I don't give a flying fk what kind of back end infrastructure they use, as long as people can get to my app.

It is surprising they don't talk refunds for the downtime, if they are taking responsibility. I'd imagine we will see this coming soon?

dpcan15y ago

2 more replies

spicyj15y ago

dpcan15y ago

1 more reply

adriand15y ago· 3 in thread

I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."

theoj15y ago

Do they have another option? If no site had survived the outage, then they may have been on to something. But with some sites surviving the outage, they just have no excuse.

jarin15y ago

Personally, I prefer to just get the blame part out of the way by taking responsibility and concentrate on the important things: fixing the problem and making sure it doesn't happen again.

I think that deep down, people aren't that concerned with who's fault it was. They just want to know that someone is going to fix it.

ghshephard15y ago

By suggesting they take responsibility, they also are in a position where they have to make good for all of the downtime their customers experienced.

Short term - that will be an expensive decision. Long term, I think it's the right thing to do. It certainly builds up my confidence level in them.

1 more reply

awicklander15y ago· 2 in thread

They gloss over their biggest failure; they weren't communicating or interacting with their customers at all.

* http://twitter.com/#!/heroku * http://twitter.com/#!/herokustatus

daniel0221615y ago

They show a history of updates to their status blog and to the herokustatus Twitter account since April 21, what do you mean by 'they weren't communicating'?

runningdogx15y ago

From 9:07 to 20:43, the status updates were generic and not very helpful in answering two questions customers want to know:

1. What exactly is going on? 2. When will it be fixed?

1 more reply

trezor15y ago· 2 in thread

And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.

showerst15y ago

akl15y ago

Are you seriously not aware that reddit falls over pretty much all the time on its own?

watchandwait15y ago· 1 in thread

The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.

watchandwait15y ago

UPDATE: we were fully restored after midnight last night. It is a very happy feeling!

greattypo15y ago· 1 in thread

It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..

JonWood15y ago

Given that Heroku charges based on the time your application is up I wouldn't be surprised if everyone just gets a bill which doesn't include the time their sites were offline.

chrishenn15y ago

http://en.wikipedia.org/wiki/Incident_command_system

waxman15y ago

Thank you for taking full responsibility.

Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.

markbao15y ago

I wish Amazon was as good at communication and accountability as Heroku is.

chrisbaglieri15y ago

"Block storage is not a cloud-friendly technology".

Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.

bdb15y ago

Where is Amazon's?

1 more reply

chubs15y ago

AffableSpatula15y ago

I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.

Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.

Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.

metageek15y ago

>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing

Heroku should save the customers this pain, by setting up anycast:

https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...

chrisbaglieri15y ago

I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is exactly what me as a paying customer wants to hear.

Kudos!

mtw15y ago

what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues

j / k navigate · click thread line to collapse