undefined | Better HN

0 pointsfipar5y ago0 comments

I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

All this to say that the cloud isn't magic. From a risk/error prevention point of view, it's not that different from writing software for a single local machine: not every programmer needs to know how to manually do memory management, it makes a lot more sense to rely on your OS and malloc (and friends) for this, but the caveat is that you do need to account for the fact that malloc may fail. In the cloud case, one can't just assume that you'll always be able to provision a new instance, scale up a service, etc. The cloud is like a utility company: normally very reliable, but they do fail too.

0 comments

11 comments · 3 top-level

tw045y ago· 8 in thread

>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

If you're saying now the story is: well rely on them to auto scale, until they don't - then why would I bother? Now you're telling me I need to go back to having infrastructure experts, which means I can save TON of money by going with a hosting provider that allows allocation of resources via API (which is basically all of them).

solidasparagus5y ago

No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.

The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

thu21115y ago

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.

But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.

1 more reply

paranoidrobot5y ago

> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.

If that capacity limit resulted in an outage, well, tough luck.

twblalock5y ago

If you are serious about reliability you always need infrastructure experts.

AWS is pretty good about documenting the limits of their systems, SLAs, how to configure them, etc. They don't just say you should wave a magic wand -- and even if they did say that, professional software engineers know better.

"a hosting provider that allows allocation of resources via API" is exactly what AWS is. Your infrastructure experts come into the picture because they need to know which resources to request, how to estimate the scale they need, and how to configure them properly. They should also be doing performance testing to see if the claimed performance really holds up.

remus5y ago

> Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

Clearly there are limits even with the largest cloud providers. You'll have to engage a bit of critical thought in to whether you're going to get near those limits and what that might mean for your product. Obviously that's easier said than done, but you could argue that the cloud providers are still giving you reasonable value if you can pass the buck on a given issue for x years.

JMTQp8lwXL5y ago

You have to know how to write code that fits into the cloud. You can't arbitrarily read/write to the file system, acting as if there's only one instance of the server running (if you plan to run hundreds or thousands). So even by waving the cloud 'magic wand', you still need to understand writing code in a cloud-friendly way. So in some sense, it's a shared responsibility between the vendor and engineering. You need to understand how to apply the tools being given to you.

tw045y ago

Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

>Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough. During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

1 more reply

ncallaway5y ago

> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.

SilasX5y ago

>This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

Heh, a while ago I joked that one way to scale is to "make it somebody else's problem", with the proviso that you need to make sure that the someone else can handle the load. And then (due to the context) a commenter balked at the idea that a big player like YouTube would be unable handle the scaling of their core business.

https://news.ycombinator.com/item?id=23170685

(If they're really blaming it on AWS, it really takes guts to do it so publicly, I think.)

Johnny5555y ago

I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck

The issue was a transit gateway, a core network component. If they weren't in the cloud, this would have been a router, so they "outsourced" it in the same way an on-prem service outsources routing to Cisco. I guess the difference is they might have had better visibility into the Cisco router and known it was overloaded.

j / k navigate · click thread line to collapse

0 comments

11 comments · 3 top-level

tw045y ago· 8 in thread

>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

solidasparagus5y ago

thu21115y ago

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

1 more reply

paranoidrobot5y ago

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

If that capacity limit resulted in an outage, well, tough luck.

twblalock5y ago

If you are serious about reliability you always need infrastructure experts.

remus5y ago

JMTQp8lwXL5y ago

tw045y ago

Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

1 more reply

ncallaway5y ago

> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.

SilasX5y ago

https://news.ycombinator.com/item?id=23170685

(If they're really blaming it on AWS, it really takes guts to do it so publicly, I think.)

Johnny5555y ago

I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck

j / k navigate · click thread line to collapse