Full technical details on Asana's worst outage (opens in new tab)

(blog.asana.com)

77 pointsmarcog19y ago62 comments

62 comments

43 comments · 10 top-level

bArray9y ago· 8 in thread

Was this incident really recorded minute by minute or is that made up? I've noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don't understand how they get that accuracy?

gjtorikian9y ago

Oh, man. Most definitely that's real.

If you're working in Slack or chat, you've got a minimum of half a dozen people typing and putting out suggestions and offering to investigate something. That's all time stamped. And even if you're not doing that real-time, you may be using something like a GitHub issue to discuss the problem via comments, which are also time-stamped.

No one at the moment of the incident is probably going "Ah, it's 8:01, better write down that I identified the problem." It's most likely "hay I think I got it one sec" and then that works. Or doesn't. But hopefully it does.

jwatte9y ago

Yes, slack and irc time stamps is common. Ideally your shell and auditing gives you that for commands, too!

stephengillie9y ago

It's from details gathered from tickets and chat history, customer reports and server logs. My team is developing a set of tools to manage our incidents, and automating the gathering of details like this are central to the reporting element.

gbin9y ago

As a tool for that, chatops is pretty cool because you can easily record your conversations but also your actions.

jon-wood9y ago

Generally it's not recorded minute by minute in the moment. When I write post mortems like this I'll piece together the timeline after things have calmed down through a combination of metrics, logs, and the ongoing discussion that takes place on Slack. To assist in that I'll tend to have a running commentary of what I'm doing in Slack even if I'm the only engineer dealing with the incident, it helps putting the timeline together later, and also means other people coming to see what's up and offer help can get caught up without interrupting.

dcosson9y ago

Often one person will be in charge of taking notes while the rest diagnose (using things like server logs or email timestamps to get these times as precise as possible). Not just for the post mortem, it can be very helpful in figuring out what happened, making sure the timing of events plausibly lines up with your hypothesis, extrapolating based on the length of a particular part of the incident to decide what to do next, etc.

dgcoffman9y ago

We reconstruct history from timestamps in Slack and our logging and monitoring systems.

beachstartup9y ago

they probably just look at the chat history and wrote a timestamped summary narrative.

judging from the number of 'sorry's in the text, seems like post mortems have been slowly adapted into a very specialized form of semi-fictional stage drama in which the audience is pandered to excessively through the use of hyperbolic apology.

katzgrau9y ago· 6 in thread

These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.

Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.

Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.

This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.

I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.

bognition9y ago

I work at a shop that does these kinds of post mortems. I find them highly beneficial.

They require us to actually do the work of identifying the issues and writing up what happened and why. I realize that having a customer contract to do this shouldn't be a requirement but human psychology is funny thing. I can turn to my pm and say "I have to do this it's part of the contract" and they immediately back off.

I agree it might not be the best solution but it's definitely better than not doing them.

dogma11389y ago

I think the OP didn't mean that these post mortems are not beneficial internally, what he said that disclosing all these details to the public can be confusing and maybe counter productive.

1 more reply

smoe9y ago

Personally I'd produce both. A brief high level explanation for non technical people (e.G. customers, press) and an in-depth blog post with the gruesome details.

The latter is useful for example when my boss asks me to evaluate whether to continue using a service after an incident. If I can't get enough information to make a recommendation I might propose a switch out of distrust. Especially when to problem was related to security or privacy.

rossjudson9y ago

Your approach works for the incident, but not for the relationship. Transparency about the technical nature of the outage is a commitment to the client that this type of outage won't recur, and steps are being taken to ensure that. It pierces the veil of arrogance by assuming client competence. That client is actually someone who reports to someone else, and they're going to have to explain their outage to the boss. For cloud providers, this kind of transparent post-mortem is the root of a fan-out of incident analysis.

marcog1OP9y ago

Every response from our users so far has been thanking is for the transparency. It also represents our internal transparency, and that has a real impact on recruiting.

noir_lord9y ago

Not an Asana user but if I were this kind of response is exactly what I like to see as a both a user and a developer so well done.

kctess59y ago· 5 in thread

I find it interesting that they didn't notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.

marcog1OP9y ago

This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms.

We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.

tomjen39y ago

What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.

3 more replies

jwatte9y ago

Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!

abhishekash9y ago

Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .

ycombinatorMan9y ago

Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?

1 more reply

qaq9y ago· 4 in thread

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

marcog1OP9y ago

We do, but it didn't help given the cause of the high cpu was our logging infrastructure (Amazon Kinesis) being overloaded by the webservers.

matt_wulfeck9y ago

Does kinesis not support UDP sylog style logging, some of these old technologies had the right idea: if your sending too much data, drop the packets on the floor instead of falling over!

babo9y ago

Autoscaling as not necessary driven by CPU load.

qaq9y ago

true but by default it is

1 more reply

mathattack9y ago· 3 in thread

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

marcog1OP9y ago

When you do daily deployments, you can't QA every one much. You rely on automated tests and Internal users using the new code for a couple hours before the deployment. We were unlucky in this case with the number of bad releases. Each was relatively minor, and ironically one was to fix a bug with the code that caused this outage. We run a 5 whys for most of them.

Mtinie9y ago

> When you do daily deployments, you can't QA every one much.

In that case, should you be doing daily deployments to production?

1 more reply

mathattack9y ago

I include automated testing in my definition of QA. (Necessary but not sufficient)

Are the daily drops predominately bug fixes or also a regular drip of new functionality?

I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.

Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!

merb9y ago· 2 in thread

> Initially the on-call engineers didn’t understand the severity of the problem

Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.

babo9y ago

For me that was the great part of the post mortem, they identified the response process itself as the root cause.

merb9y ago

yep that was what I thinking aswell.

zzzcpan9y ago· 2 in thread

Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.

jwatte9y ago

The detail was right there: debugging something in security caused massive logging which caused CPU bottlenecking.

Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.

(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)

marcog1OP9y ago

We actually have a traffic replica (dark client) setup for the new webserver architecture we are gradually migrating to. It likely would have caught this before deploying to users.

cookiecaper9y ago· 2 in thread

Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment.

Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.

Let this be a lesson to all of us. Have basic dashboards and alarming.

marcog1OP9y ago

We have very comprehensive dashboards. Getting the perfect ones that help in all cases, while not being information overload (the problem here) and being discoverable is a hard, iterative process.

cookiecaper9y ago

Yes, monitoring requires a lot of tuning until you find a sweet spot, but it doesn't sound like this is something that would've been buried deep in the annals of monitor. CPU/load data on your web servers should be pretty visible/accessible and one of the first graphs that get pulled up (and your alarms should've pointed out the issue anyway).

I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.

Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).

As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.

I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.

madelinecameron9y ago· 1 in thread

>And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version

... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.

marcog1OP9y ago

You want to replicate as much as possible, but if we ran canary on the same machines we could have testing code bring down production. That's bad.

jwatte9y ago

The real support for a frequent deployment system is in the immune system! I've had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn't immediately cause user failure. (I e, monitor crucial internals, not just user availability)

j / k navigate · click thread line to collapse

62 comments

43 comments · 10 top-level

bArray9y ago· 8 in thread

gjtorikian9y ago

Oh, man. Most definitely that's real.

jwatte9y ago

Yes, slack and irc time stamps is common. Ideally your shell and auditing gives you that for commands, too!

stephengillie9y ago

gbin9y ago

As a tool for that, chatops is pretty cool because you can easily record your conversations but also your actions.

jon-wood9y ago

dcosson9y ago

dgcoffman9y ago

We reconstruct history from timestamps in Slack and our logging and monitoring systems.

beachstartup9y ago

they probably just look at the chat history and wrote a timestamped summary narrative.

katzgrau9y ago· 6 in thread

These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.

Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.

I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.

bognition9y ago

I work at a shop that does these kinds of post mortems. I find them highly beneficial.

I agree it might not be the best solution but it's definitely better than not doing them.

dogma11389y ago

I think the OP didn't mean that these post mortems are not beneficial internally, what he said that disclosing all these details to the public can be confusing and maybe counter productive.

1 more reply

smoe9y ago

Personally I'd produce both. A brief high level explanation for non technical people (e.G. customers, press) and an in-depth blog post with the gruesome details.

rossjudson9y ago

marcog1OP9y ago

Every response from our users so far has been thanking is for the transparency. It also represents our internal transparency, and that has a real impact on recruiting.

noir_lord9y ago

Not an Asana user but if I were this kind of response is exactly what I like to see as a both a user and a developer so well done.

kctess59y ago· 5 in thread

marcog1OP9y ago

tomjen39y ago

What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.

3 more replies

jwatte9y ago

Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!

abhishekash9y ago

Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .

ycombinatorMan9y ago

Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?

1 more reply

qaq9y ago· 4 in thread

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

marcog1OP9y ago

We do, but it didn't help given the cause of the high cpu was our logging infrastructure (Amazon Kinesis) being overloaded by the webservers.

matt_wulfeck9y ago

Does kinesis not support UDP sylog style logging, some of these old technologies had the right idea: if your sending too much data, drop the packets on the floor instead of falling over!

babo9y ago

Autoscaling as not necessary driven by CPU load.

qaq9y ago

true but by default it is

1 more reply

mathattack9y ago· 3 in thread

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

marcog1OP9y ago

Mtinie9y ago

> When you do daily deployments, you can't QA every one much.

In that case, should you be doing daily deployments to production?

1 more reply

mathattack9y ago

I include automated testing in my definition of QA. (Necessary but not sufficient)

Are the daily drops predominately bug fixes or also a regular drip of new functionality?

I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.

Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!

merb9y ago· 2 in thread

> Initially the on-call engineers didn’t understand the severity of the problem

Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.

babo9y ago

For me that was the great part of the post mortem, they identified the response process itself as the root cause.

merb9y ago

yep that was what I thinking aswell.

zzzcpan9y ago· 2 in thread

jwatte9y ago

The detail was right there: debugging something in security caused massive logging which caused CPU bottlenecking.

Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.

(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)

marcog1OP9y ago

We actually have a traffic replica (dark client) setup for the new webserver architecture we are gradually migrating to. It likely would have caught this before deploying to users.

cookiecaper9y ago· 2 in thread

Let this be a lesson to all of us. Have basic dashboards and alarming.

marcog1OP9y ago

We have very comprehensive dashboards. Getting the perfect ones that help in all cases, while not being information overload (the problem here) and being discoverable is a hard, iterative process.

cookiecaper9y ago

madelinecameron9y ago· 1 in thread

>And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version

... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.

marcog1OP9y ago

You want to replicate as much as possible, but if we ran canary on the same machines we could have testing code bring down production. That's bad.

jwatte9y ago

j / k navigate · click thread line to collapse