In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:
3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.
Combined with multi-region support, this should make Heroku far more resilient in the future.
Sorry, but that's not cutting it for me right now. I pay Heroku $250 a month and I was down for 60 hours (not 16). Our app isn't even out of private beta so I fully expected to be paying Heroku $2-3K/month by the end of the year. Now, I'm not sure I'll stay.
If you're really taking 100% responsibility, then consider pro-rating the bills of affected paying customers (based on the downtime).
You generally won't get all that for $2K to $3K a month. Sure, you can drop $15K on an expensive database server, and co-locate it somewhere. But that only works until somebody takes a backhoe to your fiber, your RAID controller fails catastrophically, somebody pwns your production server, your sysadmin flakes out, or you discover that your backup scripts have been broken for months.
Realistically, if you're only spending $2-3K per month on hosting and administration, you'll eventually experience one or more of the above, and your site may be down for a day or more.
This isn't to say that I'm happy about Heroku's long downtime. One of my clients was offline almost as long as you were. But I'm pleased that Heroku recognizes just how badly they screwed up, and that they're taking the two most important steps they can to prevent a recurrence: multi-region support, and continuous backups for everyone. Multi-region support may not be sufficient to protect against cascading Amazon outages, but it's a good start.
That's implied, at the very least, by the phrase "heroku takes 100% of the responsibility. "
I would be very surprised if they didn't offer more than that.
They can keep their $20 in my view, as long as they ensure it never happens again.
You must think pretty highly of your app. Why don't you take it out of 'private beta' and let the rest of us look at it?
But being on multiple Availability Zones was supposed to be bulletproof (according to Amazon). Now that we know that wasn't the case, is being hosted on multiple regions going to provide the necessary level of protection?
Is it an over-reaction to say that relying completely on Amazon could now be seen as irresponsible to your users, given the magnitude of this event?
Honestly, I would prefer this kind of mass outage than the alternative. It's cheaper, easier, and I bet you there's still better uptime overall.
Just as Heroku took responsibility for the unexpected weaknesses their reliance on a single region created, I believe their customers should take responsibility for the unexpected weaknesses our reliance on a single hosting provider has created.
Heroku still has the value of added resiliency, even if it's not 110% bulletproof. Ultimately, we're responsible for the architecture design of our own sites.
fdr 1 hour ago | link | parent [dead] | on: Heroku's AWS outage post-mortem
The mechanism is PostgreSQL continuous archiving.
http://github.com/heroku/wal-e
This tool is still quite nascent. It received quite a trial by fire, having not (before this point) been revealed as a value-added feature to the service just yet in a wide scale.
WAL-E is a program that postgresql can use to push database changes to S3.
Depending on how you configure postgresql checkpoints, the most data you'd lose is somewhere between to a couple seconds to a minute. I'd assume Heroku would make it a couple seconds. The downside to more frequent backups is more storage space (each checkpoint (WAL archive) stored on S3 is a minimum of 75k or so, even if there weren't any changes).
I'm not deeply familiar with the Postgres versions of that (to my regret), but for the MySQL version you can read something like this:
http://dev.mysql.com/doc/refman/5.0/en/replication.html
Better yet, find yourself a copy of High Performance MySQL.
MySQL has long relied on statement based replication, which can lead to server drift in the case of any nondeterministic query. This is a total killer for extensibility of the database, as well as correctness in general. It also has a row-based-replication variant that showed up around 2008 that represents a significant improvement, but the search results for "mysql rbr" might give you pause...
My guess is this is why Amazon made the sensible choice to back their RDS (MySQL) product via DRBD and synchronous, block-device level replication, because there is no good application-level option for MySQL that is to be trusted. This technique can also be used with PostgreSQL. However, use of DRBD tends to have punishing performance impact, is complicated, and is not very suitable for a hot standby unless you write very complex shared-storage database software like Oracle RAC, hence why so much effort went into WAL-streaming hot standby in the PostgreSQL community. The DRBD option is venerable, dating back to the LiveJournal early days as their MySQL HA option, and probably before that (Credit to LiveJournal for well-documenting their HA setup, including their use of DRBD).
So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.
It is surprising they don't talk refunds for the downtime, if they are taking responsibility. I'd imagine we will see this coming soon?
It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.
It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.
Sure you can blame AWS because they said multiple availability zones in the same region would work. But at the same time there is an expectation that a site like Heroku is knowledgeable enough and sophisticated enough to intelligently process what AWS says and determine what's appropriate for them.
Personally, I prefer to just get the blame part out of the way by taking responsibility and concentrate on the important things: fixing the problem and making sure it doesn't happen again.
I think that deep down, people aren't that concerned with who's fault it was. They just want to know that someone is going to fix it.
By suggesting they take responsibility, they also are in a position where they have to make good for all of the downtime their customers experienced.
Short term - that will be an expensive decision. Long term, I think it's the right thing to do. It certainly builds up my confidence level in them.
* http://twitter.com/#!/heroku * http://twitter.com/#!/herokustatus
1. What exactly is going on? 2. When will it be fixed?
In the middle of a crisis, saying "we're aware of the problem, and we're working hard to fix it," for hours does not really count as communication. It increases customer aggravation rather than decreasing it. Customers want to know answers to the above two questions. They don't care that you know about the problem and that you're working on it, unless you're not doing those two things in which case they will be (and should be) furious; those two things are expected.
Barring the ability to tell your customers "we will be back up at X:00", I think the best approach is to share as much information as you can without getting into proprietary information. That's why I think GP considered their communication a failure. That's why I consider their communication a failure, although I've seen this pattern enough from different companies that I don't hold it against Heroku as long as they learn from it.
In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.
AMZN in general is a pretty solid 'platform' (especially if you're not using EBS), but because this whole 'cloud' thing is still partially uncharted territory, there are still holes, and you can't treat it like a normal web host.
The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.
I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.
Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.
Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.
Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.
Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.
Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.
Heroku should save the customers this pain, by setting up anycast:
https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...
Kudos!