undefined | Better HN

0 pointscmiles81d ago0 comments

AWS’s US-East 1 continues to be the Achilles heel of the Internet.

And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.

0 comments

dlenski1d ago

The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.

Roark661d ago

Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

hnlmorg20h ago

I don’t think any actually believes that.

What people usually think is “resilience up to a reasonable level of risk and cost”.

Multi-cloud is simply isn’t cost beneficial for 99.9% of problems.

And for a lot of businesses who talk about risk, saying “we followed AWS best practices but AWS went down” is an acceptable answer to the question of liability.

If you are in a position where AWS going down is a reasonable risk, then you’re already in a specialised enough domain to have engineers who understand how to deliver HA across different vendors.

doublerabbit19h ago

I jest, Anyone who thinks multicloud will provide them full resilience is fooling themselves. You need colocated hardware for true high availability.

myroon522h ago

> outside of China

[Nitpick] There are a few more AWS partitions like GovCloud:

https://jasonbutz.info/2023/07/aws-partitions/

dlenski13h ago

Yes, I'm certainly aware of the other partitions. That's why I said all the public cloud regions outside China.

Yeah, "govcloud" is technically available to the public, although there are other partitions reserved for government use that are not, and the naming is a big hairy mess. Many service teams don't have any US-citizens-in-the-USA working for them, and they cannot in any way adequately support these regions.

My on-call experience improved significantly when I moved from the US to Canada, and I got taken off the (extremely thin!) list of engineers eligible to ssh into RDS instances in Govcloud. There were so few USA-citizen-in-USA engineers that I had been getting tickets for services and instances in Govcloud about which I had only the very thinnest knowledge… and then I was limited in my ability to consult with others who were actually experts. The customers in Govcloud paid a premium to be there, I got paged for a bunch of tickets which I was ill-prepared to handle, and it was generally a bad experience for everyone.

Working with the airgapped secret/top-secret partitions was even worse. You would get paged incessantly and then someone who was cleared for access but knew almost nothing about the service in question would have to go to a SCIF in the DC area, and you would exchange screenshots and text instructions with a turnaround time of hours or days.

electroly19h ago

Since this article was written, AWS also added European Sovereign Cloud as a partition: aws-eusc.

master_crab22h ago

IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.

Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.

easton18h ago

They actually kind of fixed this recently, you can ask them to move your route53 control plane to another region in the event of us-east-1 breaking: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/ac...

There’s a bunch of caveats but it’s worth enabling if you’re changing dns all the time (as most AWS networking doodads like to do).

trollbridge20h ago

Is there an architectural reason it’s not for replicas in the other AZs?

zaphirplane1d ago

Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?

cmiles8OP23h ago

They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.

Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.

sidewndr461d ago

Isn't this kind of circular dependency what lead to extended downtime a while back?

superjan1d ago

It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.

2 more replies

jethro_tell1d ago

It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.

2 more replies

grogenaut1d ago

when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.

I'm glad I never had to get that deep into the failure chain.

1 more reply

stephenr1d ago

> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

When you dogfood your own Rube Goldberg machine.

zaphirplane1d ago

We should let the IAM service team know if this glaring gap the hn thread figured out /s

I’m 99% ;) certain dependencies of foundational services are a well discussed topic

jmsgwd22h ago

> The idea that AWS's services are fully regionalized or isolated has always been a myth.

This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.

Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.

If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."

dlenski12h ago

You're right of course to distinguish the control plane and data plane, and it sounds like you know more about this than I do for IAM.

I disagree, though, that my post was "highly misleading" despite this omission.

As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.

And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.

And above all, many AWS engineers (myself very much included even after years there) don't have a clear understanding of the boundaries of other services’ control planes. https://news.ycombinator.com/item?id=48078254

> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.

I didn't mention STS in the service to which you're responding. The service that I worked on the most, RDS, required ssh'ing into live instances to solve basically all non-trivial problems (I'd guess 80% of the tickets that I saw actually resolved required it). And I have no idea if it how STS was involved in generating the ephemeral Midway-signed ssh keys required for it… but whenever there were us-east-1 IAM outages we'd have big problems opening new sessions, while less-capable web-console-based ops tools with long-lived credentials would keep working.

Eridrus1d ago

People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

linsomniac1d ago

>And honestly, everybody else's stuff is in use-1

Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.

electroly1d ago

Not OP, but I do single-region us-east-1 for a few reasons:

1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.

2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.

3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.

4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.

coleca20h ago

Sometimes you need capacity and you have to choose where the capacity is not where you would like it to be. Unfortunately, the days of cloud bursting, and thinking of the cloud as an unlimited resource where you can spin up and spin down machines at will is vanishing. Power availability and supply chain lead times combined with unprecedented demand are the reason for this. That's why you see all the hyperscalers recently reporting on their "backlog" in their earnings reports.

christina971d ago

But it’s okay to be down when the whole internet is down.

Eridrus1d ago

90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.

nilamo19h ago

> And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

Is it not a selling point to be able to say "we're still up while out competitors are down"?

bink16h ago

It's worse when your region has issues and your customer's infrastructure is fine.

skywhopper15h ago

If you’re the one that’s down while no one else is, suddenly it becomes your fault.

skywhopper15h ago

It wasn’t even all of a single AZ. None of my resources in use1-az4 had any issues. The most annoying thing was the 20 notifications we got saying “it’s not all fixed yet” every hour.

grogenaut1d ago

none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.

9999000009991d ago

Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

afro881d ago

Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage

9999000009991d ago

The idea would be to actually load distribute between different cloud providers.

But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.

I’m sure someone smarter than me has figured this out.

3 more replies

leetrout1d ago

Bingo. This is the one most people don't know about.

b40d-48b2-979e1d ago

I was surprised recently when setting up cloudfront with aws certs that it forced me to use us-east-1 to provision the certs.

kbbgl871d ago

STS is only on us-east-1 I believe

1 more reply

echelon_musk1d ago

> It worked out with my first girlfriend. The twins are fluent in English and Korean.

You were dating twins as a form of redundancy?!

dnnddidiej21h ago

Dual writes. You'd need to have the same conversation with both to keep them in sync.

keeganpoppen1d ago

anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...

qaq1d ago

It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads

8organicbits1d ago

One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.

cherioo1d ago

Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

stingraycharles1d ago

It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.

I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.

It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.

adriand1d ago

What kind of reputation does ca-central-1 have? I’ve been using it and it seems quietly excellent. Knock on wood.

2 more replies

dehrmann16h ago

> building across multiple regions and AZs is a thing

If you do this for resiliency, be prepared to pay the capacity tax (2 regions means 2x capacity, 3 regions means 1.5x), have the machines already running in a multi-region setup (don't expect to be able to spin up instances or even get capacity during an outage), and ready to deal with the added complexity of multi-region hosting.

coredog6410h ago

There’s all kinds of fun pitfalls with multi-AZ. Like you can create RDS subnets across multiple AZs but then you can’t remove an AZ. Which really sucks when your core database covers all 5 us-east-1 AZs and randomly can’t failover because you picked an instance type that use1-az4 can’t host.

ohnei1d ago

I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.

y3ahd0g17h ago

No. This is nonsense.

Some SaaS apps had issues.

The Internet was fine.

This is physical reality. The internet was designed to route around this.

Just because some app devs do a lazy job doesn't mean the entire infrastructure as designed is garbage.

Just because some app devs are over reliant on a single cloud service doesn't mean the Internet is broken.

j / k navigate · click thread line to collapse

0 comments

dlenski1d ago

The idea that AWS's services are fully regionalized or isolated has always been a myth.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

Roark661d ago

Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

hnlmorg20h ago

I don’t think any actually believes that.

What people usually think is “resilience up to a reasonable level of risk and cost”.

Multi-cloud is simply isn’t cost beneficial for 99.9% of problems.

And for a lot of businesses who talk about risk, saying “we followed AWS best practices but AWS went down” is an acceptable answer to the question of liability.

If you are in a position where AWS going down is a reasonable risk, then you’re already in a specialised enough domain to have engineers who understand how to deliver HA across different vendors.

doublerabbit19h ago

I jest, Anyone who thinks multicloud will provide them full resilience is fooling themselves. You need colocated hardware for true high availability.

myroon522h ago

> outside of China

[Nitpick] There are a few more AWS partitions like GovCloud:

https://jasonbutz.info/2023/07/aws-partitions/

dlenski13h ago

Yes, I'm certainly aware of the other partitions. That's why I said all the public cloud regions outside China.

electroly19h ago

Since this article was written, AWS also added European Sovereign Cloud as a partition: aws-eusc.

master_crab22h ago

IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.

Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.

easton18h ago

There’s a bunch of caveats but it’s worth enabling if you’re changing dns all the time (as most AWS networking doodads like to do).

trollbridge20h ago

Is there an architectural reason it’s not for replicas in the other AZs?

zaphirplane1d ago

Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?

cmiles8OP23h ago

They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.

Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.

sidewndr461d ago

Isn't this kind of circular dependency what lead to extended downtime a while back?

superjan1d ago

It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.

2 more replies

jethro_tell1d ago

It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.

2 more replies

grogenaut1d ago

when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.

I'm glad I never had to get that deep into the failure chain.

1 more reply

stephenr1d ago

> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

When you dogfood your own Rube Goldberg machine.

zaphirplane1d ago

We should let the IAM service team know if this glaring gap the hn thread figured out /s

I’m 99% ;) certain dependencies of foundational services are a well discussed topic

jmsgwd22h ago

> The idea that AWS's services are fully regionalized or isolated has always been a myth.

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."

dlenski12h ago

You're right of course to distinguish the control plane and data plane, and it sounds like you know more about this than I do for IAM.

I disagree, though, that my post was "highly misleading" despite this omission.

As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.

And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.

> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.

Eridrus1d ago

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

linsomniac1d ago

>And honestly, everybody else's stuff is in use-1

Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.

electroly1d ago

Not OP, but I do single-region us-east-1 for a few reasons:

3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.

coleca20h ago

christina971d ago

But it’s okay to be down when the whole internet is down.

Eridrus1d ago

90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.

nilamo19h ago

> And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

Is it not a selling point to be able to say "we're still up while out competitors are down"?

bink16h ago

It's worse when your region has issues and your customer's infrastructure is fine.

skywhopper15h ago

If you’re the one that’s down while no one else is, suddenly it becomes your fault.

skywhopper15h ago

It wasn’t even all of a single AZ. None of my resources in use1-az4 had any issues. The most annoying thing was the 20 notifications we got saying “it’s not all fixed yet” every hour.

grogenaut1d ago

none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.

9999000009991d ago

Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

afro881d ago

Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage

9999000009991d ago

The idea would be to actually load distribute between different cloud providers.

But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.

I’m sure someone smarter than me has figured this out.

3 more replies

leetrout1d ago

Bingo. This is the one most people don't know about.

b40d-48b2-979e1d ago

I was surprised recently when setting up cloudfront with aws certs that it forced me to use us-east-1 to provision the certs.

kbbgl871d ago

STS is only on us-east-1 I believe

1 more reply

echelon_musk1d ago

> It worked out with my first girlfriend. The twins are fluent in English and Korean.

You were dating twins as a form of redundancy?!

dnnddidiej21h ago

Dual writes. You'd need to have the same conversation with both to keep them in sync.

keeganpoppen1d ago

qaq1d ago

It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads

8organicbits1d ago

cherioo1d ago

Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

stingraycharles1d ago

It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.

It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.

adriand1d ago

What kind of reputation does ca-central-1 have? I’ve been using it and it seems quietly excellent. Knock on wood.

2 more replies

dehrmann16h ago

> building across multiple regions and AZs is a thing

coredog6410h ago

ohnei1d ago

I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.

y3ahd0g17h ago

No. This is nonsense.

Some SaaS apps had issues.

The Internet was fine.

This is physical reality. The internet was designed to route around this.

Just because some app devs do a lazy job doesn't mean the entire infrastructure as designed is garbage.

Just because some app devs are over reliant on a single cloud service doesn't mean the Internet is broken.

j / k navigate · click thread line to collapse