What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"
Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.
This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.
All this to say that the cloud isn't magic. From a risk/error prevention point of view, it's not that different from writing software for a single local machine: not every programmer needs to know how to manually do memory management, it makes a lot more sense to rely on your OS and malloc (and friends) for this, but the caveat is that you do need to account for the fact that malloc may fail. In the cloud case, one can't just assume that you'll always be able to provision a new instance, scale up a service, etc. The cloud is like a utility company: normally very reliable, but they do fail too.
Slack could have chosen one of many other AWS design patterns such as VPC peering, transit VPC, IGW routing, or colocating more services in fewer VPCs (with more granular IAM role policies to separate operator privileges), to provide an automatically scaled network fabric to connect their services.
(This isn't to criticize Slack's engineering team. They have successfully scaled their service in a short time, and I'm happy with their product overall, and with their transparency in this report. But I think AWS has the world's biggest and most scalable network fabric - it's just a matter of knowing how to harness it.)
Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.
It is a interesting though because a lot of the blog posts like "How we handled a 3000% traffic increase overnight!" boil down to "We turned up the AWS knob".
What happens when the AWS knob doesn't work?
They explained to me that they'd intentionally slam the production website with external traffic a couple of times per year, at a scheduled time in the middle of the night. Like basically an order of magnitude greater than they'd every received in real life, just to try to find the breaking point. The production website would usually go down for a bit, but this was vastly better than the website going down when actual real users are trying to sign up for the Boston Marathon.
Slack probably should've anticipated this surge in traffic after the holidays, and if might have been able to run some better simulations and fire drills before it occurred.
Sounds like AWS knew how to handle it too.
Given how AWS has responded to past events like this, I'd bet there's an internal post-mortem and they'll add mechanisms to fix this scaling bottleneck for everyone.
Although one thing I'm not clear on is if this was really an AWS issue or if Slack hit one of the documented limits of Transit Gateway (such as bandwidth), after which AWS started dropping packets. If that's the case then I don't see what AWS could have done here, other than perhaps have ways to monitor those limits, if they don't already. The details here are a bit fuzzy in the post.
TL;DR it’s still your responsibility to understand the limitations of your infrastructure decisions and engineer your systems accordingly.
Slack didn’t know how to handle it, they paid AWS hoping the product did what it said on the tin. They didn’t test for this case and got bit.
They have millions of clients they could have coordinated to load test this stuff by picking some time to disable the cache and fallback to cache if it failed.
One approach to solve problems of scale is to trim down scale and bound it across multiple disparate silos that do not absolutely interact with each other at all, under any circumstances, except for making quick, constant-time, scale-independent decisions, may be.
In short, do things that don't need scale.
Or maybe their monitoring and response staff was just coming back online.
"My bet is that this incident is caused by a big release after a post-holiday "code freeze". "
I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.
- Disable autoscaling if appropriate during outage. For example if the web server is degraded, it's probably best to make sure that the backends don't autoscale down.
- Panic mode in Envoy is amazing!
- Ability to quickly scale your services is important, but that metric should also take into account how quickly the underlying infrastructure can scale. Your pods could spin up in 15 seconds but k8s nodes will not!
I'll also say that I'm interested in ubiquitous mTLS so that you don't have to isolate teams with VPCs and opaque proxies. I don't think we have widely-available technology around yet that eliminates the need for what Slack seems to have here, but trusting the network has always seemed like a bad idea to me, and this shows how a workaround can go wrong. (Of course, to avoid issues like the confused deputy problem, which Slack suffered from, you need some service to issue certs to applications as they scale up that will be accepted by services that it is allowed to talk to and rejected by all other services. In that case, this postmortem would have said "we scaled up our web frontends, but the service that issues them certificates to talk to the backend exploded in a big ball of fire, so we were down." Ya just can't win ;)
I agree with you the mTLS is the future. It exists within many companies internally (as a VPC alternative!) and works great. There’s some problems around the certificate issuer being a central point of failure, but these are known problems with well-understood solutions.
I think there’s mostly a non-technical barrier to be overcome here, where the non-technical executives need to understand that closed network != better security. mTLS’s time in the sun will only come when the aforementioned sales pitch is less effective (or even counterproductive!) for Enterprise Inc., I think.
I wish it were better supported though.
> We’ve also set ourselves a reminder (a Slack reminder, of course) to request a preemptive upscaling of our TGWs at the end of the next holiday season.
Probably the only way to see a problem is if you have a flat line for bandwidth, but as the article suggested they had packet drop wich does not appear on the cloudwatch metrics, aws should add those metrics imo
Traffic picked up heavily on some website or app, AWS didn't auto-scale fast enough or at all and the very systems that are designed to be elastic just tumbled down to a grinding halt?
I used to work in professional kitchens before software, and it feels a lot like the pressure of a really busy night as a line cook. Some people love it.
Hearing about others' similar experiences makes me feel a connection to them, and often teaches me something.
https://www.eschrade.com/page/why-is-fastcgi-w-nginx-so-much...
[0]: https://slack.engineering/hacklang-at-slack-a-better-php/
mod_python was abandoned around a decade ago. It's crashing on python 2.7.
mod_perl was dropped in 2012 with the release of apache 2.4. It was kicked out of the project but continues to exist as a separate project (not sure if it works at all).
Sounds like the monitoring system needs a monitoring system.
For Prometheus users, I wrote alertmanager-status to let a third-party "website up?" monitoring server check your alertmanager: https://github.com/jrockway/alertmanager-status
(I also wrote one of the main Google Fiber monitoring systems back when I was at Google. We spent quite a bit of time on monitoring monitoring, because whenever there was an actual incident people would ask us "is this real, or just the monitoring system being down?" Previous monitoring systems were flaky so people were kind of conditioned to ignore the improved system -- so we had to have a lot of dashboards to show them that there was really an ongoing issue.)
I wonder how many VPCs people have before transitioning over to TGW.
https://slack.engineering/building-the-next-evolution-of-clo...
I think you really have to look at metric-based autoscaling and say: is it worth the X% savings per month? Or would I rather avoid the occasional severe headaches caused by autoscaling messing up my day? Obviously this depends on company scale and how much your load varies. I'd rather have an excess of capacity than any impact on users.
The "configuring and testing new instances" part also sounds very fishy to me. Configuration should be done when creating the image and launch template, while testing should be the job of the load balancing layer. Why do we need a separate "provision-service" to piece everything together?
Who knows the different ways they may be able to get out of that. I assume this wasn't one of those times.
I am really not impressed... with the state of IT. I could not have done better, but isn't it too bad that we've built these towers of sand that keep knocking each other over?
It's similar to the whole buildings a long time ago last much longer than those today. Its true in the literal sense, but it ignores the fact that we've gotten at reducing the cost of stuff like skyscrapers and bridges.
In our pursuit of efficiency, we do things like JIT delivery, dropshipping, scaling, building to the minimum spec. Sometimes, we get it wrong and it comes tumbling down (covid, HN hug of death, earthquakes).