And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.
All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.
And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.
During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.
But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.
What people usually think is “resilience up to a reasonable level of risk and cost”.
Multi-cloud is simply isn’t cost beneficial for 99.9% of problems.
And for a lot of businesses who talk about risk, saying “we followed AWS best practices but AWS went down” is an acceptable answer to the question of liability.
If you are in a position where AWS going down is a reasonable risk, then you’re already in a specialised enough domain to have engineers who understand how to deliver HA across different vendors.
[Nitpick] There are a few more AWS partitions like GovCloud:
Yeah, "govcloud" is technically available to the public, although there are other partitions reserved for government use that are not, and the naming is a big hairy mess. Many service teams don't have any US-citizens-in-the-USA working for them, and they cannot in any way adequately support these regions.
My on-call experience improved significantly when I moved from the US to Canada, and I got taken off the (extremely thin!) list of engineers eligible to ssh into RDS instances in Govcloud. There were so few USA-citizen-in-USA engineers that I had been getting tickets for services and instances in Govcloud about which I had only the very thinnest knowledge… and then I was limited in my ability to consult with others who were actually experts. The customers in Govcloud paid a premium to be there, I got paged for a bunch of tickets which I was ill-prepared to handle, and it was generally a bad experience for everyone.
Working with the airgapped secret/top-secret partitions was even worse. You would get paged incessantly and then someone who was cleared for access but knew almost nothing about the service in question would have to go to a SCIF in the DC area, and you would exchange screenshots and text instructions with a turnaround time of hours or days.
Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.
There’s a bunch of caveats but it’s worth enabling if you’re changing dns all the time (as most AWS networking doodads like to do).
Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.
I'm glad I never had to get that deep into the failure chain.
When you dogfood your own Rube Goldberg machine.
I’m 99% ;) certain dependencies of foundational services are a well discussed topic
This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]
The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.
Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).
> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.
This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.
If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.
[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."
[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."
[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."
I disagree, though, that my post was "highly misleading" despite this omission.
As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.
And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.
And above all, many AWS engineers (myself very much included even after years there) don't have a clear understanding of the boundaries of other services’ control planes. https://news.ycombinator.com/item?id=48078254
> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.
> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.
I didn't mention STS in the service to which you're responding. The service that I worked on the most, RDS, required ssh'ing into live instances to solve basically all non-trivial problems (I'd guess 80% of the tickets that I saw actually resolved required it). And I have no idea if it how STS was involved in generating the ephemeral Midway-signed ssh keys required for it… but whenever there were us-east-1 IAM outages we'd have big problems opening new sessions, while less-capable web-console-based ops tools with long-lived credentials would keep working.
And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.
Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.
1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.
2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.
3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.
4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.
Is it not a selling point to be able to say "we're still up while out competitors are down"?
In fantasy magic dream land loads are distributed evenly across different cloud providers.
A single point of failure doesn't exist.
It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.
Healthcare in the US is affordable.
All types of magical stuff exist here.
But no. It's another day. AWS US-East 1 can take town most of the internet.
But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.
I’m sure someone smarter than me has figured this out.
You were dating twins as a form of redundancy?!
Last i heard azure outage it wasn’t even on HN frontpage
I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.
It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.
If you do this for resiliency, be prepared to pay the capacity tax (2 regions means 2x capacity, 3 regions means 1.5x), have the machines already running in a multi-region setup (don't expect to be able to spin up instances or even get capacity during an outage), and ready to deal with the added complexity of multi-region hosting.
Some SaaS apps had issues.
The Internet was fine.
This is physical reality. The internet was designed to route around this.
Just because some app devs do a lazy job doesn't mean the entire infrastructure as designed is garbage.
Just because some app devs are over reliant on a single cloud service doesn't mean the Internet is broken.