1. Which system is broken?
2. Are there any recent changes to this system? If so, can we try reverting them?
They did "1", quickly identified Consul as the issue. They made a significant Consul change the day before, one they were clearly cautious/worried about (i.e. they'd been slowly adopting the new Consul streaming feature, service by service, for over a month, and did a big rollout of it the previous day). And once they did identify streaming as the issue, it was indeed quick to roll back. It just seems like they never tried "2" above, which is strange to me, very contrary to my experience being on call at multiple companies.
While it's true that rolling back recent changes is always one of the first things to consider, we should acknowledge that sometimes it can be worse than finding a way to roll forward. Maybe the Roblox engineers had good reason to be wary of pulling that trigger too quickly when Consul or BoltDB were involved. Maybe it even turned out, in perfect 20/20 hindsight, that foregoing that option was the wrong decision and prolonged the outage. But one of the cardinal rules of incident management is that learning depends on encouraging people to be open and honest, which we do by giving involved parties liberal benefit of the doubt for trying to do the right thing based on information they had at the time. Yes, even if that means allowing them to make mistakes.