undefined | Better HN

0 pointsnotacoward4y ago0 comments

In a not-too-distant alternate universe, they made the rookie assumption that every change to every system is trivially reversible, only to find that it's not always true (especially for storage or storage-adjacent systems), and ended up making things worse. Naturally, people in alternate-universe HN bashed them for that too.

0 comments

4 comments · 2 top-level

yashap4y ago· 2 in thread

Obviously I'm on the outside looking in here - can't say anything with confidence. But I've been on call consistently for the past 9 years, for some decent sized products (not Roblox scale, but on the order of 1 million active users), mitigating more outages than I can count. For any major outage, the playbook has always been something like this:

1. Which system is broken?

2. Are there any recent changes to this system? If so, can we try reverting them?

They did "1", quickly identified Consul as the issue. They made a significant Consul change the day before, one they were clearly cautious/worried about (i.e. they'd been slowly adopting the new Consul streaming feature, service by service, for over a month, and did a big rollout of it the previous day). And once they did identify streaming as the issue, it was indeed quick to roll back. It just seems like they never tried "2" above, which is strange to me, very contrary to my experience being on call at multiple companies.

Karrot_Kream4y ago

If you're doing a slow rollout, it's not always easy to tell whether the thing you're rolling out is the culprit. I've been on the other side of this outage where we had an outage and suspected a slow change we had been rolling out, especially because we opted something new into it minutes before an incident, only to realize later when the dust settled that it was completely unrelated. When you're running at high scale like Roblox and have lots of monitoring in place and multiple pieces of infrastructure at multiple levels of slow-rollout, outages like this one don't quickly point to a smoking gun.

notacowardOP4y ago

What do you do when you're working on a storage system and rolling back a change leaves some data in a state that the old code can't grok properly? I've seen that cause other parts of the system (e.g. repair, re-encoding, rebalancing) mangle it even further, overwrite it, or even delete it as useless. Granted, these mostly apply to code changes rather than config, but it can also happen if code continue to evolve on both sides of a feature flag, and both versions are still in active use in some of the dozens of clusters you run. Yes, speaking from experience here.

While it's true that rolling back recent changes is always one of the first things to consider, we should acknowledge that sometimes it can be worse than finding a way to roll forward. Maybe the Roblox engineers had good reason to be wary of pulling that trigger too quickly when Consul or BoltDB were involved. Maybe it even turned out, in perfect 20/20 hindsight, that foregoing that option was the wrong decision and prolonged the outage. But one of the cardinal rules of incident management is that learning depends on encouraging people to be open and honest, which we do by giving involved parties liberal benefit of the doubt for trying to do the right thing based on information they had at the time. Yes, even if that means allowing them to make mistakes.

erosenbe04y ago

Spot on. And some things are easily reversible to the extent that they alleviate the downtime, yet still leave a large data sync or etl job to complete in their wake. The effect of which, until resolved, is continued loss of function or customer data at some lesser level of severity.

j / k navigate · click thread line to collapse