possibly related: https://news.ycombinator.com/item?id=32267154
Now I guess we have to move to us-west-2. :)
Update: looks like it's only one zone anyway, so my statement still stands!
Edit: although, one of our vendors that uses AWS has said that they think ELB registration is impacted (but I don't recall if that's regional?) and R53 is impacted (which is supposed to be global, IIRC). Dunno how much truth there is to it as we don't use AWS directly.
Thanks!
us-west-2 has had outages as well but it is less common, even rare. I've been pushing companies to make their initial deployments onto us-west-2 for over ten years now. I occasionally get kudos messages in my inbox :)
So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea).
That being said, there is still an added cost and complexity to operate in multiple AZs, because you have to synchronize data across the AZs. Also you have to have enough reserved instances to move into when you lose an AZ, because if you're running lean and each zone is serving 33% of your traffic, suddenly the two that are left need to serve 50% each.
The bigger companies with overhead reservations will get all the instances before you can launch any on demand during an AZ failure.
If you’re Amazon where every second is millions of $ in transactions you care more than StartUp that has 1 request per minute. Even if you accept the risk, you still care when your DC goes down.
Also, a large chunk of AWS is managed from a single data center so if that one goes down you may still have issues with your service in another data center.
Also, I think a lot, but not all of the services I use work okay with multiple regions.
On top of that, I was looking at the documentation for KMS keys yesterday, and a KMS key can be multiregion, but if you don't create it as multiregion from the start, you can't update the multiregion attribute. So you need to create a new KMS key and update everything to use the new multiregion key.
Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.
- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?
- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?
- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?
I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.
However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.
The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.
People are just unaware, and probably making bad calls in the name of being "portable".
I also disagree that it is inherently more costly to run a service in multiple locations.
Not really.
What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.
Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.
Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.
(Or worse, the redundancy causes a subtle failure like data loss.)
If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.
Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.
1) AWS is already really expensive, just on a single AZ. Replicating to a second AZ would almost double your costs. I can't help but bring up the point that an old-school bare-metal setup on something like Hetzner/OVH/etc becomes significantly more cost-effective since you're not using AWS's advantages in this area anyway (and as we've seen in practice, AWS is nowhere near more reliable - how many times have AWS' AZs gone down vs the bare-metal HN server which only had its single significant outage very recently? - it makes sense considering the AWS control plane is orders of magnitude more complex than an old-school bare-metal server which just needs power and a network port).
2) It is extremely hard to build reliable systems over time (since during non-outage periods, everything appears to work fine despite accidentally introducing a hard dependency on a single AZ), and even more so to account for second-order effects such as an inter-AZ link suddenly becoming saturated during the outage. I'm personally not confident at all in Amazon's (or frankly, any public cloud provider's) ability to actually guarantee seamless failover during an outage, since the only way to prove it's working is to have a real outage as to induce any potential second-order effects such as inter-AZ links suddenly becoming saturated, which AWS or any other cloud provider aren't going to do (as an intentional, regularly-scheduled outage for testing would hurt anyone who intentionally doesn't use multiple AZs, essentially pricing them out of the market by forcing them to either commit to the cost increase of multi-AZ or move to a provider who doesn't do scheduled outages for testing purposes).
Take advantage of AWS (or Azure, or DO) until you're big enough that bringing the action in-house is a financially and technically prudent option.
I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.
I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across
What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit
We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.
There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.
Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).
P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.
First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).
Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].
[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...
AWS AZs don't even have consistent naming across AWS accounts.
10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
It could be that their shared-fate scope is an entire data hall, or a set of rows, or even an entire building given that an AZ is made up of multiple datacenters. I don't know that AWS has ever published any kind of sub-AZ guarantees around reliability.
Datacenter power has all kinds of interesting failure modes. I've seen outages caused by a cat climbing into a substation, rats building a nest in a generator, fire-fighting in another part of the building causing flooding in the high-voltage switching room, etc.
But that system breaks down here when you need to know whether you are in an affected zone. Is there a way to map an account’s AZ name to the canonical one which apparently exists?
Examples about how these relate: https://stackoverflow.com/questions/63283340/aws-map-between...
I'm also pretty sure that GCP's identifiers are absolute (and this time, throughout) as well, since their documentation (which renders the same in incognito mode or whatever) makes reference to what zones have what microarchitectures and instance types.
$ dig news.ycombinator.com
;; ANSWER SECTION:
news.ycombinator.com. 1 IN A 50.112.136.166
$ dig -x 50.112.136.166
;; ANSWER SECTION:
166.136.112.50.in-addr.arpa. 300 IN PTR ec2-50-112-136-166.us-west-2.compute.amazonaws.com.
saving couple keypresses just in casehttps://docs.google.com/spreadsheets/d/1Gcq_h760CgINKjuwj7Wu... (from https://awsmaniac.com/aws-outages/)
Always check HN before trying to diagnose weird issues that shouldn't be connected
This isn't good, and someone who can do something about it needs to.
The nature of our business means it wasn't a big deal, but I could imagine lots of people were in the same boat.
Having everything well-architected on AWS is...well, it's a problem for reasons of monopoly and cost, but it's not a problem for availability.
https://aws.amazon.com/compute/sla/
99.5% availability allows up to about 3 and a half hours of downtime a month. 99.99% means around 4 minutes a month. So if you can't handle hours of downtime, you should definitely be multi-AZ.
Datacenters do end up completely dying now and then, you really want to have a good strategy in that case. Or not, if that's not required.
Edit: 2 minutes after I post this it starts working.
https://www.oracle.com/customers/zoom/
I'm not sure if Zoom has any Critical infra in AWS though.
I don't work on cloud stuff, so I'm genuinely unsure if this is a joke.
You'll need to see which availability zone ID (e.g., use2-az3) corresponds to each zone in your account: https://aws.amazon.com/premiumsupport/knowledge-center/vpc-m...
edit: AWS identified this as a power loss in a single zone, use2-az1.
And if I would have read the page the link points to better, that's exactly the reason
Physical identifiers for availability zones are like "use2-az1", ref.
https://docs.aws.amazon.com/ram/latest/userguide/working-wit...
https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...
us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.
Also, because shared and global AWS resource are (or at least often behave as if they are) intimately tied to us-east-1.
Agreed, money can't solve everything. BUT, this seems like an extremely solvable problem. That's why I'm so surprised.