People need to be blamed, and responsibility for actions taken (without covering asses)
I have no empathy for Fastly-the-company. I hate the fact that the Internet is centralized around CDNs. I wish this idea of 'but we _must_ run a CDN for our 1QPM blog!' would die in a fire. But I can still empathize with the Fastly engineers handling this shitstorm right now.
People must be held accountable to have good incentives to reduce such outtages in the future.
I do agree though that we should always be compassionate and realistic with other humans.
How do you make sure that mistakes don't happen, then? Do you blame and fire people who make mistakes, and hope that the next person put in the same spot doesn't make a mistake? Or do you figure out what caused that person to make the mistake and ensure there are processes in place so that next time this is less likely to happen?
Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up.
v2. "The issue was caused by a previously unidentified pathway that caused a feedback loop and overloaded our servers in a cascading fashion (or whatever). We have implemented a fix for this and updated our testing and deployment processes to stop similar cascades."
Which solves the problem long term?
As an architect making product choices, v2 wins every time.
(With the caveat that if the cause was something that reveals a fundamental problem with the larger processes/professionalism/culture of the company, especially to do with security concerns, then I'm not buying that product and migrating away if we already use it.
Holding specific people "accountable" for outages doesn't incentivize reducing outages; it incentivizes not getting caught for having caused the outage.
As a result, post-mortems turn into finger-pointing games instead of finding and resolving the root cause of the issue, which costs the company more money in the long run when a political scapegoat is found but the actual bug in the code is not.
I feel like this requires some nuance.
Don't blame an IC for introducing a bug or misconfiguration that led to the outage.
Do consider blaming (and firing!) management if, during the postmortem, it turns out that it was in the way of fixing systemic problems.
Ultimately, rule #1 should be: don't blame somebody unless malice or gross negligence is proven. Rule #2 should be the assumption that ICs will not have done either. Rule #3 is that sometimes, individual responsibility is required.
Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again.
Obviously if there are levels of gross negligence or misconduct discovered during post-mortem, that will need to be dealt with accordingly, but coming into this with an attitude of "we must find someone to blame and incur repercussions" isn't healthy at all.
We are humans - don't forget that.
edit: forgot some words.
> Notices will be posted here when we re-route traffic, upgrade hardware, or in the extremely rare case our network isn’t serving traffic. - status.fastly.com
The extremely rare case happened for an hour, which is a very long time in internet time.
- ignoring warnings
- acting against known-to-them best practices
- repeating a previous mistake
But, again, these are just indicators, not a checklist.
Interestingly, any of these can happen also due to stress, burnout and generally broken company/team culture. Including something like a CYA culture where if they don't do something fast, they will be blamed for it, and thus they need to move fast and break things.
"An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization."
The best way (in a team), to tackle mistakes, is to ensure the process in place corrects these mistakes. The only way to do that, is a post-mortem/learning from the mistake. If you blame it on some engineer who did it, that guy will eventually be replaced by some other guy, who may make the same mistake.
And we, especially companies, typically only learn if there is something at stake. Stock-price, a job, customers, liability etc.
(Call me old fashioned, but what I learned from it, having no stake in the game, is we are truly demolishing the resilient, decentralised nature of the internet; or already have done so)
Post-mortems make far more interesting submissions IMO, but I suppose people up-vote 'yes down for me too'.
We do not have a system that adjusts to "oops"
A good leader will take the hit (and the repercussions) for their underlings, compensate customers where compensation can make it better (and offer to make it easy to use fallbacks if this happens again) -- and internally fix the problem so it can't happen again, without throwing anyone to the dogs.
What i think this syntactically invalid sentence is trying to say is:
People need to be blamed, and held responsible for actions taken.
Why do people need to be blamed? Why do we need to make someone the scapegoat? What does being held responsible look like?
Let say we find some sacrificial engineer to pin this on:
* does the downtime magically disappear?
* does the engineer suffering (say losing his job or whatever) make your downtime meaningful? You'll recoup your revenue somehow from it?
* does the fact that there's a scapegoat mean that everyone else at fastly is perfect and it's ok to keep using them?
Emapthy and responsiblity are not mutually exclusive.
This. When people talk about "HugOps", "empathy" and all that when a worldwide incident affecting a huge amount of time critical customers (e.g. trading, hft, cargo, food delivery, etc.) is happening for an hour, it has catastrophic consequences.
I hope the engineers also understand the other side and why we are paying huge sums of cash for their service.