I agree that improvements made to their deployment tooling were good and necessary, take the human temptation to skip steps out of the equation.
But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact.
I find this absolutely unacceptable. How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here? Yes I'm familiar with the halting problem and limitations of formal verification on turing complete languages, but I don't believe it's an excuse.
This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".
All server software has one or more "infinite loops." It is a fundamental object in all listeners.
Plus when they say infinite loop, I assumed they meant it continuously entered a crash/restart cycle rather than a while(true) {} in a line of code.
I think the reality on the ground is that bug-free software is a myth. All you can do is have processes (like gradual deployment) to mitigate the damage it can do, rather than making it your goal to write the mythical perfect code.
> But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact. I find this absolutely unacceptable
It is a major problem. It costs billions every year. But what can be done? If there was a magic wand solution I'm sure people would be scrambling to deploy it as it saves them money.
> How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here?
This seems like a somewhat naive view of software development in general. Like what I'd call a "mathematician's view," in the sense that they think large complex systems can be reduced to a simple quantifiable process.
Code reviews and more importantly unit tests can help find bugs. But inter-connectivity between large complex systems is harder to test again, and harder to code review (because the bugs don't exist on any single line of code, or in any single block even).
Which would be a pretty reasonable thing to say if you had a large portion of the population on a single plane. What this all comes down to is: problems (small or big; stupidly simple or ridiculously complex) will happen. Isolating problems to the smallest number of people possible is the responsible course of action.
That, of course, isn't mutually exclusive with doing better from a software engineering front.
Things break. You can not design something to be perfect. That is why there is redundancy in every critical system. You are better off having your system gracefully recover from failure, than trying to design the system perfectly. That might mean spreading the traffic across multiple instances within one data center so when one breaks (for any reason), the others pick up the slack. The next level is spreading across multiple data centers so if one place goes down, you have another pick up the slack. Arguably the next level would be going across multiple providers, but to me that seems like overkill.
Given that, if Microsoft rolls it out to 5% of servers and those crap out, that is roughly equivalent for individual customers that are properly spread over multiple instances as a spate of harddrive failures. This only breaks down when they roll the broken stuff out to 100% at once.
Generally is comes down to the 'insurance' argument - why didn't we spend the time (read: money) to test and prevent for X?
The answer corms down to the risk/benefit.
It's possible to insure your house against total loss, against any type of threat. You could build it on the shoreline of a known hurricane location, or on top of an active volcano, and insure it for full replacement. All you have to do is deposit an amount equal to the replacement cost in an account - if you suffer total loss, spend the money and replace the house.
Where people go wrong is by thinking that problems can and should be prevented at any cost. But the issue is that thinking that way leads to excessive costs for the thing in the first place. It would be possible to design a highway system where nobody ever died. However the cost would be so high that very few highways would be built, so the advantages of cheap and easy travel are lost.
Likewise, it would be possible for Microsoft to build an automated testing and checking software that never made a mistake. However that would make azure uncompetitive or unprofitable. It's cheaper just to hire good people and accept that occasionally, something might go wrong.
Some software is actually made to never go wrong. That software is in satellites and mars rovers and the like. Even then mistakes happen due to the nature of complexity and probability. But the cost per line of delivered code for a satellite is orders of magnitude higher than the cost of azure management code.
You really only need to look at problems when the cost of fix is much better than the cost of potential loss. That's why planes are safer than cars - because the loss of a big plane and passengers is a very costly event.
Not only was it deployed to everyone but it was deployed, untested, to the wrong place. Otherwise deploy would have been just as broad but successful.
I really do love tests and all but they only get you so far. In fact you're way more often bitten by things that are outside of your frame of reference and therefore these are not the ones you take into account when designing testing pipeline.
One defense is a "canary" deployment process (they used the term "flighting") to ensure major changes are rolled out slowly enough to detect major performance shifts. Had their deployment process worked correctly, they may have been able to roll back the change without incident.
A second defense is proactively building "safeties" and "blowoff valves" into your software. Example: if a client notices a huge spike in errors, back off before retrying a connection request, otherwise you may put the system into a positive feedback loop. Ethernet collision detection/avoidance is a great example of a safety mechanism done well.
Finally, every high-scale domain has its own problems, which experienced engineers know to worry about. In my case, at an analytics provider, one of the hardest problems we face is data retention: how much to store, at what granularity, for how long, and how that interacts with our various plan tiers. OTOH we have significant latitude to be "eventually correct" or "eventually consistent" in a way a bank, stock exchange, or other transactional financial system (e.g. credit approval) can't be. I imagine other things like ad serving, video serving, game backend development, etc. there are similar "gotchas", but I don't know what they are.
“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”
A more complete one would look at what each engineer had contributed. If the one with the foul-up has and does contribute significant value, and doesn't repeat the same mistakes, it's a good call.
If the careful worker both avoids errors (and costly mistakes) and exceeds other engineers in contributing value, there's a strong argument for keeping them.
There are people who simply foul things up. And there are those who avoid mistakes by simply never taking risks. You almost certainly want to discard the first. The second's value depends on the value your organization gains from innovation.
The first thing my boss said the next morning before even explaining what happened was that he was the one responsible and I was not to blame. Oh and he gave me a raise.
So Azure supports Linux VMs?! Microsoft does so little Azure advertising that I had to learn this fact from their RCA. Apparently they do support it since 2012: http://www.techrepublic.com/blog/linux-and-open-source/micro... but it is likely that many non-users of Azure do not know this.
Although yes, their advertising of Azure and Azure features isn't very huge. The ads that I have seen are for "Microsoft Cloud" (http://www.microsoft.com/enterprise/microsoftcloud/default.a...) which is a combination of products and technologies.
I feel that Microsoft is on fairly even footing with AWS and Google Cloud.
You can view the recorded video on: https://hacksummit.org/
Another minor outage though, I think. They recently changed the permissions on /mnt, which contains the temp disk. Now various services that were using that for temp space fail to start. Easy enough to fix by chmod on start but still...
Ugh, I wouldn't want to be that guy (even if there would be no direct repercussions). That said, and as others have highlighted - kudos on the writeup and openness.
1. Since we tested the change on a subset of A for a few weeks, we can assume it will work for all of A.
2. Since we tested the change on a subset of A for a few weeks, we can assume it will work for all of A and all of B.
#1 seems reasonable, but #2 is what needed to hold true in order for there to be no problems, since the change was actually enabled for all of A and B.
But was the engineer actually advocating to enable the change in B, or was that an accident during the manual deployment?
3. Since we tested this change on a subset of A and a subset of B, we can assume it will work for all of A and all of B.
That's the thing about these kinds of bugs...they are, by definition, tricky enough to have passed testing unseen.
These kinds of "almost took X offline" happen All The Time, its just that most of the time they get caught before it gets too far. Its inevitable that a few will squeak through the nets.
Mistakes can and will happen anywhere we allow them to. If you want to prevent mistakes, write tools to help reduce the "attack surface" (areas where mistakes can be made). Eg Don't want someone to be able to do "sudo reboot" accidentally? Alias reboot to something else. It won't stop hackers but it might help fight fat fingers.
Accidents happen.
For example they discover a pilot made a mistake. But they don't end it there, they then look at the airline's training materials, see if other pilots would repeat the same mistakes, and so on until they reach a point where they have a "this won't happen again" resolution (rather than simply discovering what happened).
I feel like with Microsoft's breakdown they did the "this is what happened" post-mortem but then went to the next level and said "here's why this happened, and here is why it won't happen again."
I've found it to be pretty standard in the hosting world. I assume because if you have unexplained outages, customers leave.
A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices ?
The surface-level answer is that Azure platform lacked tooling. Is that the cause or an effect ? I think it is an effect. There are deeper root causes.
Let's ask -- why was it that the design allowed one engineer to effectively bring down Azure ?
We often stop at these RCAs when it gets uncomfortable and it starts to point upwards.
I say this to the engineer who pressed the buttons: Bravo! You did something that exposed a massive hole in Azure which may have very well prevented a much bigger embarrassment.
Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out.
I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting.
For critical infrastructure companies there is the usual rule of "four eyes" for roll outs.
So, while it may be the case that most companies will have the trusted person with the keys to the rollout car the more critical the mission gets the higher the levels of human checks are put in.
Maybe that's what the RCA should have said -- we F-ed up designing and managing the rollout process. An engineer just fell victim to it.
".... We reverted the change globally within 30 minutes of the start of the issue which protected many Azure Blob storage Front-Ends from experiencing the issue. The Azure Blob storage Front-Ends which already entered the infinite loop were unable to accept any configuration changes due to the infinite loop. These required a restart after reverting the configuration change, extending the time to recover."
The ten+ hours extension was the vast majority of the outage time; why wasn't the reason for this given? More importantly, what will be done to prevent a similar extension in the time Azure spends belly up if at some point in the future, the Blob servers go insane and have to be restarted?
That may help address these questions. Just FYI, I am an engineer in the Azure compute team.
Cons: -This is almost 30 days after the incident -Look at the regions, it was global! -This was a whole chain of issues. I count it as 5 separate issues. This goes deep into how they operate and it does not paint a picture of operational maturity:
1: configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends
2: Blob Front-Ends infinite loop delayed the fix (I count this as a separate issue though I expect some may not)
3: As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service
4: Update was made across most regions in a short period of time due to operational error
5: Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health Dashboard
That is quite a list. [Edit : formatting only]
What would have been the optimal response time? They fixed the immediate problem as fast as they could and gave a preliminary RCA, then they did a longer-term RCA and fix. I feel this shows maturity by not rushing to immediate conclusions and trying to do a 5-Whys drill-down to fix the underlying cause. Furthermore, they also took steps to actually fix the problem by pointing out they moved the human out of the loop in one aspect and that's always a good thing (unless the replacement software is faulty itself of course).
Also, in response to the list, I believe [3&4] are actually the same thing, are they not? The operator was the one who made the 'decision' by accidentally ignoring the incremental config change policy that was in place and did it all at once. This was identified as a human error and they fixed it by enforcing incremental changes.
I agree with you though, and said so at the time. These issues seem systemic, not isolated. Cascading failures, some in code, some in ops procedures, indicates to me they still have work to do.
>In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol.
Hopefully there is a way to disable this policy adherence for when you really need to push out a configuration or code change everywhere quickly.
If we wish for a future where cloud computing will be considered reliable enough for air traffic control systems, then management of these infrastructure requires a level of dedication and commitment to process and training.
Failover zones need to be isolated not only physically, but also from command and control. A lone engineer should not have sufficient authority or capability to operationally control more than one zone. It is extremely unnerving for enterprises to see that a significant infrastructure like Azure has a root account which can take down the whole of Azure.
I also hope that no one who recommended Azure to their employer got fired either.