undefined | Better HN

0 pointstptacek4y ago0 comments

That doesn't sound accurate. Wasn't the major change they ended up rolling back Consul streaming, which they'd enabled months before, and had been slowly rolling out?

0 comments

2 comments · 2 top-level

twblalock4y ago

Right, but the day before the outage, they enabled streaming for a service that didn't have it turned on. That's a discrete config change, the day before the outage.

yashap4y ago

> Several months ago, we enabled a new Consul streaming feature on a subset of our services. This feature, designed to lower the CPU usage and network bandwidth of the Consul cluster, worked as expected, so over the next few months we incrementally enabled the feature on more of our backend services. On October 27th at 14:00, one day before the outage, we enabled this feature on a backend service that is responsible for traffic routing. As part of this rollout, in order to prepare for the increased traffic we typically see at the end of the year, we also increased the number of nodes supporting traffic routing by 50%

So they rolled out a pretty significant Consul related change the day before their massive Consul outage began. They’d been doing a slow rollout, but ramping it up a bunch is a significant change.

j / k navigate · click thread line to collapse