Every outage I read, something like that happened. At least asana didn't blamed the technology their were using.
Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.
Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.
This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.
I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.
They require us to actually do the work of identifying the issues and writing up what happened and why. I realize that having a customer contract to do this shouldn't be a requirement but human psychology is funny thing. I can turn to my pm and say "I have to do this it's part of the contract" and they immediately back off.
I agree it might not be the best solution but it's definitely better than not doing them.
The latter is useful for example when my boss asks me to evaluate whether to continue using a service after an incident. If I can't get enough information to make a recommendation I might propose a switch out of distrust. Especially when to problem was related to security or privacy.
... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.
If you're working in Slack or chat, you've got a minimum of half a dozen people typing and putting out suggestions and offering to investigate something. That's all time stamped. And even if you're not doing that real-time, you may be using something like a GitHub issue to discuss the problem via comments, which are also time-stamped.
No one at the moment of the incident is probably going "Ah, it's 8:01, better write down that I identified the problem." It's most likely "hay I think I got it one sec" and then that works. Or doesn't. But hopefully it does.
judging from the number of 'sorry's in the text, seems like post mortems have been slowly adapted into a very specialized form of semi-fictional stage drama in which the audience is pandered to excessively through the use of hyperbolic apology.
We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.
Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!
In that case, should you be doing daily deployments to production?
Are the daily drops predominately bug fixes or also a regular drip of new functionality?
I think the old world of quarterly releases was also bad for other reasons. I'm curious about the right middle point.
Every time a company like Asana comes clean about outages and software quality issues, the canon of knowledge improves. Thank you for sharing!
Performance is the hardest thing to integration test for. Keeping careful track of CPU/memory/network/disk load with automated alerts can help.
(Fancy systems like running a traffic replica can help, too, but at a much higher cost.)
Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.
Let this be a lesson to all of us. Have basic dashboards and alarming.
I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.
Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).
As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.
I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.