AWS us-east-2 outage

336 pointsjchen423y ago247 comments

our alerts just went crazy, and we're having issues even logging in to the AWS dashboard

247 comments

153 comments · 42 top-level

poxrud3y ago· 35 in thread

The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry. It would be one thing if this was a Regional failure, but a single AZ failure should not have any noticeable effect.

dangero3y ago

For most businesses a little down time here and there is a calculated risk versus more complex infrastructure. You can’t assume all the cloud architects are idiots — they have to report their task list and cost of infrastructure to someone who can give feedback on various options based on comparative resource requirements and risks.

Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.

justapassenger3y ago

This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.

3 more replies

cromulent3y ago

Yeah, makes sense if explicitly stated. Not everything is worth the money.

However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.

The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.

ilikecakeandpie3y ago

This is true, but I think it would be more acceptable if the region were down vs the single AZ

happymellon3y ago

Considering almost all of the services are multi-zone, it's not hard to add in a couple of lines to make them resilient against this.

People are just unaware, and probably making bad calls in the name of being "portable".

2 more replies

water-your-self3y ago

It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.

I also disagree that it is inherently more costly to run a service in multiple locations.

4 more replies

boomer9183y ago

A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.

3 more replies

outworlder3y ago

> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.

Not really.

What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.

Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.

Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.

macintux3y ago

Or, the redundancy actually causes a failure, so not only have you spent more money but you’ve reduced your availability doing so.

(Or worse, the redundancy causes a subtle failure like data loss.)

1 more reply

bilalq3y ago

It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.

If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.

Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.

Nextgrid3y ago

I have 2 takes on this:

1) AWS is already really expensive, just on a single AZ. Replicating to a second AZ would almost double your costs. I can't help but bring up the point that an old-school bare-metal setup on something like Hetzner/OVH/etc becomes significantly more cost-effective since you're not using AWS's advantages in this area anyway (and as we've seen in practice, AWS is nowhere near more reliable - how many times have AWS' AZs gone down vs the bare-metal HN server which only had its single significant outage very recently? - it makes sense considering the AWS control plane is orders of magnitude more complex than an old-school bare-metal server which just needs power and a network port).

2) It is extremely hard to build reliable systems over time (since during non-outage periods, everything appears to work fine despite accidentally introducing a hard dependency on a single AZ), and even more so to account for second-order effects such as an inter-AZ link suddenly becoming saturated during the outage. I'm personally not confident at all in Amazon's (or frankly, any public cloud provider's) ability to actually guarantee seamless failover during an outage, since the only way to prove it's working is to have a real outage as to induce any potential second-order effects such as inter-AZ links suddenly becoming saturated, which AWS or any other cloud provider aren't going to do (as an intentional, regularly-scheduled outage for testing would hurt anyone who intentionally doesn't use multiple AZs, essentially pricing them out of the market by forcing them to either commit to the cost increase of multi-AZ or move to a provider who doesn't do scheduled outages for testing purposes).

gamblor9563y ago

Going bare-metal is a premature optimization. Most startups that go that route don't survive long enough to make use of this optimization.

Take advantage of AWS (or Azure, or DO) until you're big enough that bringing the action in-house is a financially and technically prudent option.

1 more reply

nijave3y ago

As some others have alluded to, it seems common AWS services (the ones you rely on to manage multi-AZ traffic like ALBs and Route53) spike in error rate and nose dive in response time so it becomes difficult to fail things over. On top of that, services like RDS that run active hot standby then rely on those to fail over so it's difficult to get the DB to actually fail over.

I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.

I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across

What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit

throwaway2016a3y ago

Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in.

We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.

There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.

Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).

h023y ago

Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.

zob_cloud3y ago

Which internal health checks are you referring to?

1 more reply

jozzy-james3y ago

that tracks with our experience as well

nodesocket3y ago

It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.

First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).

Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].

[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...

rjh293y ago

It's a game theory thing. If everyone stays single AZ, everyone goes down at the same time so nobody gets blamed. Somehow the blame falls on AWS instead!

harryh3y ago

I think you're confusing availability zones with regions in this comment.

AWS AZs don't even have consistent naming across AWS accounts.

2 more replies

water-your-self3y ago

Best take lol

sosodev3y ago

I don't think there's a shortage of people who can architect reliable services. I think companies simply put reliability on the back burner because it rarely bites them. It's the same reason technical debt is so rarely paid off.

boplicity3y ago

> technical debt is so rarely paid off.

It's not debt if you don't have to pay for it -- and if the ongoing costs of whatever it is are relatively insignificant.

akkishore3y ago

But technical debt bites you in every new feature by slowing new code addition.

0xbadcafebee3y ago

There is a shortage of good cloud engineers, but even if there were more of them, the business doesn't give a crap about brief outages like this. Blame it on AWS and move on, business as usual. Even if they did care, the business is often too incompetent to understand that they could easily prevent these things. And even if they did realize it, they don't want to prioritize it over pushing out another half-baked feature, making sales, getting their bonus.

kbumsik3y ago

Multi-AZ architecture just double the cost at least, and it tends to cost even much more if the business is small. Good engineers find the balance between the cost and the availability.

poxrud3y ago

No that is not correct, it is not double the cost, please see my reply above.

1 more reply

trelliscoded3y ago

Replicating a huge database between AZs, let alone regions, can be an enormously expensive ongoing operational cost. Not everyone can afford it.

poxrud3y ago

Assuming you're using RDS then multi-AZ deployment is just a simple configuration option. If you're using Aurora then it is handled automatically and is even less expensive.

1 more reply

Leires3y ago

I can tell you from experience that the cloud architects are world class and it's actually the data techs that are the problem. Amazon doesn't value data center techs, they don't pay competitively and hire techs that barely have enough skill so they can pay them nothing. Then they metric the fuck out of the teams so that everyone focuses on quick fixes instead of taking the time to troubleshoot long-term persistent issues. Couple this with the fact that management is only concerned with creating new capacity instead of fixing existing capacity.

bearjaws3y ago

Will need to see the post-mortem, when us-east-1 had its last big outage multiple AZs were working, but cross AZ functionality (lambda, event bridge) were impacted... which made recovery problematic.

ajanuary3y ago

Not looked into it too closely yet, but for us it looks like there were also issues connecting between the two remaining AZ in our 3 node cluster.

jozzy-james3y ago

we definitely had issues with all of the AZs in east-2, and far more services impacted than just EC2 (f.e. RDS, elasticache were intermittently down for us)

1 more reply

rr8883y ago

My take is that so many sites are broken, maybe I shouldn't care either. The extra complexity of dealing with high availability is something that probably isn't worth it for my project. Spend more time on features instead.

shiftpgdn3y ago

Companies don’t want to pay for in house architecture/etc and developers are generally ultra hostile towards ops people.

jedberg3y ago· 20 in thread

Sorry all I jinxed it. Yesterday I was in a meeting and said "The only regional outages AWS has ever had were in us-east-1, so we should just move to us-east-2."

Now I guess we have to move to us-west-2. :)

Update: looks like it's only one zone anyway, so my statement still stands!

fishnchips3y ago

Stay in us-east-1, they provide Chaos Monkey for free. It's a feature.

dexterdog3y ago

I always say stay in use1 because almost everybody is there and when it's suffering any kind of outage so much of the internet is affected that it's no big deal that you're a part of the outage. People just go outside and get some air knowing it will be back up in a few hours, usually right around the time the AWS status page acknowledges that there is an issue.

1 more reply

jedberg3y ago

Chaos Kong is the one that takes out whole regions. ;)

1 more reply

sergiotapia3y ago

I'm moving to us-weast-1

2 more replies

deathanatos3y ago

The AWS status board, posted elsewhere in the comments, seems to think this is an AZ outage, not a regional one.

Edit: although, one of our vendors that uses AWS has said that they think ELB registration is impacted (but I don't recall if that's regional?) and R53 is impacted (which is supposed to be global, IIRC). Dunno how much truth there is to it as we don't use AWS directly.

water-your-self3y ago

AWS is notorious did underreporting and failing to report. They do not have asafe culture and its bad for your career if there is a major outage

papito3y ago

Please don't move to us-west. We are probably going to have an 11-point earthquake the next day.

Thanks!

acwan933y ago

In all seriousness, we've been deploying everything on us-west-2, and it seems to have dodged most of the outages recently. Is there something special about that data center?

arecurrence3y ago

Classically, us-east-1 received most of the hate given its immense size (it used to be several times larger than any other) and status as the first large aws data center. It also seemed to launch new aws features first but that may have been my imagination. If true, I'm sure always running the latest builds was not great for stability.

us-west-2 has had outages as well but it is less common, even rare. I've been pushing companies to make their initial deployments onto us-west-2 for over ten years now. I occasionally get kudos messages in my inbox :)

4 more replies

jedberg3y ago

It's never been a default datacenter. For a long time the default when you first logged into the console was us-east-1 so a lot of companies set up there (that's where all of reddit was run for a long time and Netflix too). At some point they switched the default to us-east-2.

So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea).

tinym3y ago

Rather the opposite - us-west-2 is big but not the biggest region, or the smallest, or the oldest or newest, it's not partitioned off like the China or GovCloud regions. Because us-west-2 is fairly typical it tends to be one of the last regions to get software updates, after they've been tested in prod elsewhere

alex_young3y ago

Looks like this particular issue was due to power loss, and for power us-west-2 has one clear advantage: It's power is directly from the Columbia river and highly unlikely to have demand based outages.

1 more reply

highwaylights3y ago

If us-west-2 goes down in the next few days we’ll expect an explanation.

yodsanklai3y ago

Naive question: don't people who care about resiliency have their services in more than one datacenter? or datacenter failure is considered such a rare event that's it's not worth the cost/trouble of using more?

jedberg3y ago

AWS makes it pretty easy to operate in multiple AZs within a region (each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other).

That being said, there is still an added cost and complexity to operate in multiple AZs, because you have to synchronize data across the AZs. Also you have to have enough reserved instances to move into when you lose an AZ, because if you're running lean and each zone is serving 33% of your traffic, suddenly the two that are left need to serve 50% each.

The bigger companies with overhead reservations will get all the instances before you can launch any on demand during an AZ failure.

2 more replies

vineyardmike3y ago

Some people care about it but not enough to justify the added downsides - multi-data center is expensive (you pay per data center) and it’s complex (data sharding/duplication/sync).

If you’re Amazon where every second is millions of $ in transactions you care more than StartUp that has 1 request per minute. Even if you accept the risk, you still care when your DC goes down.

Also, a large chunk of AWS is managed from a single data center so if that one goes down you may still have issues with your service in another data center.

jamesfinlayson3y ago

I'd consider using it, but the biggest roadblock for me is that I work in a regulated industry in Australia, and until AWS finishes their Melbourne region (next year maybe?) I'm stuck in one region because all private data needs to stay in Australia.

Also, I think a lot, but not all of the services I use work okay with multiple regions.

On top of that, I was looking at the documentation for KMS keys yesterday, and a KMS key can be multiregion, but if you don't create it as multiregion from the start, you can't update the multiregion attribute. So you need to create a new KMS key and update everything to use the new multiregion key.

vhgyu75e6u3y ago

AWS works with multiple availability zones (AZ) per region, some products by default deploy in several ones at the same time, while others leave it up to you.

jugg1es3y ago

AWS makes it trivially easy to distribute across more than one datacenter... The only time that outages make the news is when they all fail in a region.

nodesocket3y ago

Jinxed for sure. I refuse to deploy resource into us-east-1 unless required by the service.

packetslave3y ago· 9 in thread

Update from AWS: they lost power to (part of?) a single DC in the use2-az1 availability zone.

10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.

alfalfasprout3y ago

Interesting to see it's been a loss of power that caused this. Usually the better datacenters have multiple levels of power redundancy including emergency backup generators.

packetslave3y ago

It depends entirely on how AWS architected their power redundancy. Given that the outage affected a portion of one DC in one AZ, we can make some assumptions, but the truth is we just don't know.

It could be that their shared-fate scope is an entire data hall, or a set of rows, or even an entire building given that an AZ is made up of multiple datacenters. I don't know that AWS has ever published any kind of sub-AZ guarantees around reliability.

Datacenter power has all kinds of interesting failure modes. I've seen outages caused by a cat climbing into a substation, rats building a nest in a generator, fire-fighting in another part of the building causing flooding in the high-voltage switching room, etc.

1 more reply

throwawaymaths3y ago

Insert clip of O'Brien explaining to cardassians why there are backups for backups

1 more reply

jmartens3y ago

Ya, am I surprised by this too. Like, you have one job, keep the power on.

brasic3y ago

I thought that AWS availability zones were intentionally not canonically named to prevent everyone from adding stuff to AZ “A”. So my us-east-1 zone “A” might be your “B”.

But that system breaks down here when you need to know whether you are in an affected zone. Is there a way to map an account’s AZ name to the canonical one which apparently exists?

fdr3y ago

They gave up on that, now there's an extra "zone ID" you can read that maps to an absolute address. They used to be extremely cagey about giving out those mappings for your account.

Examples about how these relate: https://stackoverflow.com/questions/63283340/aws-map-between...

I'm also pretty sure that GCP's identifiers are absolute (and this time, throughout) as well, since their documentation (which renders the same in incognito mode or whatever) makes reference to what zones have what microarchitectures and instance types.

packetslave3y ago

The mapping from AZ Name (account-specific) to AZ ID (global) shows up on the EC2 overview page in the dashboard.

nathanwh3y ago

Yep! https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

res0nat0r3y ago

This is true, at the AWS account level. us-east-2a for my account may map to the internal use2-az1, but in your account us-east-2a may map internally to use2-az2.

dmalvarado3y ago· 7 in thread

I dunno how else to put it. Having EVERYTHING on AWS is a national security threat.

This isn't good, and someone who can do something about it needs to.

gtirloni3y ago

Good thing we don't have EVERYTHING on AWS, so no threat detected.

smt883y ago

Our own applications are hosted on Azure, but we had an outage today anyway. It was because apparently Netlify and Auth0 use AWS and went down, which took down our static sites and our authentication.

The nature of our business means it wasn't a big deal, but I could imagine lots of people were in the same boat.

harryh3y ago

There is no "someone" who could do anything about this.

outworlder3y ago

Yeah, let's place everything in large colos instead. Those never fail, right?

Nextgrid3y ago

But the colos aren't usually managed by a single control plane controlled by a single company, so while they can all fail, they will generally do so independently.

scubbo3y ago

Having _everything_ on a single AZ of AWS is, indeed, a problem.

Having everything well-architected on AWS is...well, it's a problem for reasons of monopoly and cost, but it's not a problem for availability.

Nextgrid3y ago

Certain "global" AWS services depend on a specific region (us-east-1 if I recall right) so if that goes down you could be in trouble.

corobo3y ago· 5 in thread

Ahh

Always check HN before trying to diagnose weird issues that shouldn't be connected

dang3y ago

Living a bit more dangerously at the moment as HN is still running temporarily on AWS. (I'd link to the threads about this from a few weeks ago but am on my phone ATM.)

corobo3y ago

I did notice it being a little slow but I'm also on 4G at the moment (it got the blame)

Syonyk3y ago

And the reason that works is because HN is mostly hosted on its own stuff, without weird dependencies on anything beyond "the servers being up" and "TCP mostly working."

EddySchauHai3y ago

I believe it's on AWS after its two servers broke at the same time the other day.

2 more replies

jmartens3y ago

Too bad that is the best we have!

kaustubhvp3y ago· 5 in thread

Looks like a lot of services are impacted including Cloudflare, Ping, Zoom and Datadog.

jgrahamc3y ago

Not sure why we're on that list. We run our own infrastructure and are not built on AWS.

svnpenn3y ago

Looks like Snap, Crackle and Pop are down as well.

CoastalCoder3y ago

> Looks like Snap, Crackle and Pop are down as well.

I don't work on cloud stuff, so I'm genuinely unsure if this is a joke.

1 more reply

collinvandyck763y ago

Hah, frequency illusion strikes again? I just learned about the derivatives past Jerk yesterday.

qeternity3y ago

Cloudflare uses AWS? For what?

MikeMacMan3y ago· 5 in thread

I'm seeing issues in 2a but not 2b. Anyone having issues in 2b?

silverlyra3y ago

AWS availability zones are randomly shuffled for each AWS account – your us-east-2a won't (necessarily) be the same as another user's (or even another account in the same organization): https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

You'll need to see which availability zone ID (e.g., use2-az3) corresponds to each zone in your account: https://aws.amazon.com/premiumsupport/knowledge-center/vpc-m...

edit: AWS identified this as a power loss in a single zone, use2-az1.

blahyawnblah3y ago

I wonder if this is done because people have a tendency or something to always create resources in 'A' (or some other AZ) and this helps spread things around.

And if I would have read the page the link points to better, that's exactly the reason

NovemberWhiskey3y ago

Just FYI; availability zone names in AWS are randomized between accounts - your "a" can be someone else's "f"

Physical identifiers for availability zones are like "use2-az1", ref.

https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

bradyd3y ago

Availability zones are not guaranteed to have the same name across accounts (ie. us-east-2a in one account might be us-east-2d in another). You would need to use the AZ-ID to determine if they are the same.

https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...

NightMKoder3y ago

AWS availability zones (so like us-west-2b rather than us-west-2) are not the same between accounts. us-west-2b for you is something different than us-west-2b for everyone else.

linsomniac3y ago· 5 in thread

I understand that us-east is AWS's oldest and biggest facility, but Amazon seems to have more money than Croesus, why aren't they fixing/rebuilding/replacing us-east with something more modern?

ctvo3y ago

us-east is a geographic distinction within which there are multiple regions. us-east-1 and us-east-2 are not the same. This outage occurred in us-east-2. Within an AWS region there are multiple data centers. They call their data centers availability zones. The availability zone AZ1 was the one impacted, and within that availability zone, most likely only a subset of servers.

us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.

dragonwriter3y ago

> us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.

Also, because shared and global AWS resource are (or at least often behave as if they are) intimately tied to us-east-1.

philihp3y ago

My first instinct would be to guess that something like this happened because of some intentional and well-meaning effort to upgrade some critical part of their infrastructure. Just my hunch given that it happened during the middle of the week in the middle of the day, and came back relatively quickly. The quick but not instantaneous bounce back has the hallmark of someone following a carefully laid out worst case contingency plan. I look forward to the postmortem.

_xnmw3y ago

Because money can't fix everything? In fact sometimes having too much money makes it worse, as YC startup wisdom says.

linsomniac3y ago

I've avoided responding since my reply that started this was downvoted... But...

Agreed, money can't solve everything. BUT, this seems like an extremely solvable problem. That's why I'm so surprised.

hakube3y ago· 4 in thread

I'm running Terraform and it appears to be stuck now. What do I do??

danw19793y ago

Depends what it’s stuck doing, but you might ctrl-c it and later manually unlock the state file (by carefully coordinating with colleagues and deleting the dynamo DB lock object if you’re using the s3 backend) when the outage is over.

eurasiantiger3y ago

Thanks, this comment made it very clear to me that I never want to touch a terraform system.

3 more replies

outworlder3y ago

Control+C (once!) is usually enough to cause it to abort without any ill effects to the state file. If it really got stuck and you have to kill it, then sure, you might have to mess with it a bit.

NeckBeardPrince3y ago

Wait

eatonphil3y ago· 4 in thread

Wonder if this is why Zoom is down. Wasn't able to connect just now. The connection proxy/sites were giving 504s.

jasonjayr3y ago

IIRC Zoom signed up with Oracle Cloud when COVID hit and they needed to scale like crazy.

https://www.oracle.com/customers/zoom/

I'm not sure if Zoom has any Critical infra in AWS though.

easton3y ago

I interviewed there a few months ago for DevOps, and one of the people I interviewed with said that most of Zoom was in AWS (they liked that I had AWS stuff on my resume).

bluedino3y ago

WebEx too

adrr3y ago

Okta is degraded as well.

bigfatfrock3y ago· 3 in thread

Suppose it's time to setup multi-az and pay to insure against AWS' own failures. I don't know why I previously thought their EC2 uptime claims were sufficient. Lesson learned.

Johnny5553y ago

Are you sure you understand their uptime claims? They offer a 99.99% SLA for regional availability, but only 99.5% for individual instances (and even then, they only owe you a 10% service credit for affected instances)

https://aws.amazon.com/compute/sla/

99.5% availability allows up to about 3 and a half hours of downtime a month. 99.99% means around 4 minutes a month. So if you can't handle hours of downtime, you should definitely be multi-AZ.

okdood643y ago

What are you hosting? Multi-AZ seems like a bare minimum for basic reliability. That said it's not a panacea. There's all sorts of cascading/downstream "weirdness" that can result on AWS's own services through the loss of an AZ.

chronid3y ago

Multi-AZ is a requirement on production level loads if you cannot sustain prolonged downtime.

Datacenters do end up completely dying now and then, you really want to have a good strategy in that case. Or not, if that's not required.

kaustubhvp3y ago· 3 in thread

AWS acknowledged the issue https://health.aws.amazon.com/health/status

allenrb3y ago

[10:25 AM PDT] We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.

rwky3y ago

Funny I failed away from the zone and RDS still doesn't work, connections fail.

Edit: 2 minutes after I post this it starts working.

allenrb3y ago

"[10:11 AM PDT] We are investigating network connectivity issues for some instances and increased error rates and latencies for the EC2 APIs within the US-EAST-2 Region."

aaur03y ago· 2 in thread

AWS reported the outage here : https://health.aws.amazon.com/health/status

monocasa3y ago

> Severity

> Informational

lol

Analemma_3y ago

Three things are certain: death, taxes, and useless cloud status dashboards.

1 more reply

mrwnmonm3y ago· 1 in thread

And I thought us-east-2 is the way to escape us-east-1's problems.

snapcaster3y ago

Haha literally had this same thought. us-east-2 is our default region for most stuff and so far that's been good. I think this is the first AWS downtime in last couple years that hit our systems directly

packetslave3y ago· 1 in thread

On the bright side, this is certainly a good test for "exactly how resilient are our AWS-based systems to the loss of a single availability zone?"

dudeinjapan3y ago

Very resilient, provided that its not the AZ I'm using.

02thoeva3y ago· 1 in thread

Thanks for sharing. We've just spent the last hour debugging our website, thinking we had issues. This explains it.

jmartens3y ago

Interestingly, we saw a bunch of other services degrade (Zoom, Zendesk, Datadog) before AWS services themselves degrade.

allenrb3y ago· 1 in thread

us-east-2 customer here, having a variety of "strange" issues including inability to reach an RDS database, other users in my firm having VPN reconnect trouble to that region.

allenrb3y ago

FWIW, most of our issues just resolved ~1 minute ago. We'll see if it remains stable.

numpad03y ago

  $ dig news.ycombinator.com
  ;; ANSWER SECTION:
  news.ycombinator.com.   1       IN      A       50.112.136.166

  $ dig -x 50.112.136.166
  ;; ANSWER SECTION:
  166.136.112.50.in-addr.arpa. 300 IN     PTR     ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

saving couple keypresses just in case

ctur3y ago

I find this spreadsheet handy for thinking about AWS region-wide outages and frequency. This seems to be the first major us-east-2 outage, indeed, vs us-east-1 and other regions.

https://docs.google.com/spreadsheets/d/1Gcq_h760CgINKjuwj7Wu... (from https://awsmaniac.com/aws-outages/)

joshstrange3y ago

Just lost my email provider (https://status.postmarkapp.com/incidents/240161) to this and I'd bet my services are degraded/down. I know it's never a "good" time for an outage but this sure does suck for me right now. We've got an event this weekend and people can't sign up right now to buy tickets/etc.

ChrisArchitect3y ago

We have always been at war with us-east-2.

kaustubhvp3y ago

Zoom is having connectivity issues. Nothing on https://status.zoom.us/ page yet.

weeeeelp3y ago

Anyone's ECR endpoints went out during the outage? We've had timeouts while pulling images onto our k8s cluster post-restart

jonatron3y ago

I just set up a few small sites (not live yet) on us-east-2, because us-east-1 has a poor reputation. I wanted to avoid multi-region to keep things simple, but now I'm thinking I might have to spend the additional time on it. Not ideal when there's no dedicated ops.

rwky3y ago

Same. Our EC2 instances can't connect RDS and just got 500 errors on the dashboard.

jmartens3y ago

Can confirm based on what Metrist is seeing. Looks to be a larger issue in the US East, also seeing Cloudfare and Datadog with issues.

nemothekid3y ago

I've somehow dodged region outages on AWS for years, and here's my first one. So many alerts firing off in unexpected ways.

sjm-lbm3y ago

Same here - we were finally able to log in to the console, but we're in us-east-2 and are having a ton of issues.

whendon3y ago

Tons of issues in us-east-2 here as well

tpl3y ago

Personally only saw intermittent failures through out. Rather minor production as far as AWS outages go.

mathgladiator3y ago

No wonder my error logs were clean. I can get into my hosts, but my LB isn't routing. Sad face.

bkruse3y ago

Lots of issues in us-east-2 for instances for us but also other regions when connecting to RDS

nnf3y ago

Lots of things intermittently unreachable in us-east-2 for us, across multiple AWS accounts.

jupp0r3y ago

Chaos monkey is on the move again.

jmartens3y ago

Anyone else notice similar issues in US-West-2 a few hours before this issue in US-East-2?

myroon53y ago

unfortunately one of the largest regions:

https://github.com/patmyron/cloud/#ip-addresses-per-region

smm113y ago

Had Rackspace login issues earlier today. Hmmmmmm ...

ilikecakeandpie3y ago

us-east-2 customer here also having some issues

jdugan3y ago

We have RDS and ECS issues in us-east-2

kaustubhvp3y ago

yeah looks like us-east-2 has networking issues

Justin_K3y ago

Can confirm....

annamarie3y ago

Same

j / k navigate · click thread line to collapse

247 comments

153 comments · 42 top-level

poxrud3y ago· 35 in thread

dangero3y ago

Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.

justapassenger3y ago

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.

3 more replies

cromulent3y ago

Yeah, makes sense if explicitly stated. Not everything is worth the money.

The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.

ilikecakeandpie3y ago

This is true, but I think it would be more acceptable if the region were down vs the single AZ

happymellon3y ago

Considering almost all of the services are multi-zone, it's not hard to add in a couple of lines to make them resilient against this.

People are just unaware, and probably making bad calls in the name of being "portable".

2 more replies

water-your-self3y ago

It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.

I also disagree that it is inherently more costly to run a service in multiple locations.

4 more replies

boomer9183y ago

A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.

3 more replies

outworlder3y ago

> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.

Not really.

Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.

macintux3y ago

Or, the redundancy actually causes a failure, so not only have you spent more money but you’ve reduced your availability doing so.

(Or worse, the redundancy causes a subtle failure like data loss.)

1 more reply

bilalq3y ago

It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.

Nextgrid3y ago

I have 2 takes on this:

gamblor9563y ago

Going bare-metal is a premature optimization. Most startups that go that route don't survive long enough to make use of this optimization.

Take advantage of AWS (or Azure, or DO) until you're big enough that bringing the action in-house is a financially and technically prudent option.

1 more reply

nijave3y ago

I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.

I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across

What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit

throwaway2016a3y ago

We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.

h023y ago

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.

zob_cloud3y ago

Which internal health checks are you referring to?

1 more reply

jozzy-james3y ago

that tracks with our experience as well

nodesocket3y ago

It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.

[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...

rjh293y ago

It's a game theory thing. If everyone stays single AZ, everyone goes down at the same time so nobody gets blamed. Somehow the blame falls on AWS instead!

harryh3y ago

I think you're confusing availability zones with regions in this comment.

AWS AZs don't even have consistent naming across AWS accounts.

2 more replies

water-your-self3y ago

Best take lol

sosodev3y ago

boplicity3y ago

> technical debt is so rarely paid off.

It's not debt if you don't have to pay for it -- and if the ongoing costs of whatever it is are relatively insignificant.

akkishore3y ago

But technical debt bites you in every new feature by slowing new code addition.

0xbadcafebee3y ago

kbumsik3y ago

Multi-AZ architecture just double the cost at least, and it tends to cost even much more if the business is small. Good engineers find the balance between the cost and the availability.

poxrud3y ago

No that is not correct, it is not double the cost, please see my reply above.

1 more reply

trelliscoded3y ago

Replicating a huge database between AZs, let alone regions, can be an enormously expensive ongoing operational cost. Not everyone can afford it.

poxrud3y ago

Assuming you're using RDS then multi-AZ deployment is just a simple configuration option. If you're using Aurora then it is handled automatically and is even less expensive.

1 more reply

Leires3y ago

bearjaws3y ago

Will need to see the post-mortem, when us-east-1 had its last big outage multiple AZs were working, but cross AZ functionality (lambda, event bridge) were impacted... which made recovery problematic.

ajanuary3y ago

Not looked into it too closely yet, but for us it looks like there were also issues connecting between the two remaining AZ in our 3 node cluster.

jozzy-james3y ago

we definitely had issues with all of the AZs in east-2, and far more services impacted than just EC2 (f.e. RDS, elasticache were intermittently down for us)

1 more reply

rr8883y ago

shiftpgdn3y ago

Companies don’t want to pay for in house architecture/etc and developers are generally ultra hostile towards ops people.

jedberg3y ago· 20 in thread

Sorry all I jinxed it. Yesterday I was in a meeting and said "The only regional outages AWS has ever had were in us-east-1, so we should just move to us-east-2."

Now I guess we have to move to us-west-2. :)

Update: looks like it's only one zone anyway, so my statement still stands!

fishnchips3y ago

Stay in us-east-1, they provide Chaos Monkey for free. It's a feature.

dexterdog3y ago

1 more reply

jedberg3y ago

Chaos Kong is the one that takes out whole regions. ;)

1 more reply

sergiotapia3y ago

I'm moving to us-weast-1

2 more replies

deathanatos3y ago

The AWS status board, posted elsewhere in the comments, seems to think this is an AZ outage, not a regional one.

water-your-self3y ago

AWS is notorious did underreporting and failing to report. They do not have asafe culture and its bad for your career if there is a major outage

papito3y ago

Please don't move to us-west. We are probably going to have an 11-point earthquake the next day.

Thanks!

acwan933y ago

In all seriousness, we've been deploying everything on us-west-2, and it seems to have dodged most of the outages recently. Is there something special about that data center?

arecurrence3y ago

4 more replies

jedberg3y ago

So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea).

tinym3y ago

alex_young3y ago

1 more reply

highwaylights3y ago

If us-west-2 goes down in the next few days we’ll expect an explanation.

yodsanklai3y ago

jedberg3y ago

AWS makes it pretty easy to operate in multiple AZs within a region (each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other).

The bigger companies with overhead reservations will get all the instances before you can launch any on demand during an AZ failure.

2 more replies

vineyardmike3y ago

Some people care about it but not enough to justify the added downsides - multi-data center is expensive (you pay per data center) and it’s complex (data sharding/duplication/sync).

If you’re Amazon where every second is millions of $ in transactions you care more than StartUp that has 1 request per minute. Even if you accept the risk, you still care when your DC goes down.

Also, a large chunk of AWS is managed from a single data center so if that one goes down you may still have issues with your service in another data center.

jamesfinlayson3y ago

Also, I think a lot, but not all of the services I use work okay with multiple regions.

vhgyu75e6u3y ago

AWS works with multiple availability zones (AZ) per region, some products by default deploy in several ones at the same time, while others leave it up to you.

jugg1es3y ago

AWS makes it trivially easy to distribute across more than one datacenter... The only time that outages make the news is when they all fail in a region.

nodesocket3y ago

Jinxed for sure. I refuse to deploy resource into us-east-1 unless required by the service.

packetslave3y ago· 9 in thread

Update from AWS: they lost power to (part of?) a single DC in the use2-az1 availability zone.

alfalfasprout3y ago

Interesting to see it's been a loss of power that caused this. Usually the better datacenters have multiple levels of power redundancy including emergency backup generators.

packetslave3y ago

It depends entirely on how AWS architected their power redundancy. Given that the outage affected a portion of one DC in one AZ, we can make some assumptions, but the truth is we just don't know.

1 more reply

throwawaymaths3y ago

Insert clip of O'Brien explaining to cardassians why there are backups for backups

1 more reply

jmartens3y ago

Ya, am I surprised by this too. Like, you have one job, keep the power on.

brasic3y ago

I thought that AWS availability zones were intentionally not canonically named to prevent everyone from adding stuff to AZ “A”. So my us-east-1 zone “A” might be your “B”.

But that system breaks down here when you need to know whether you are in an affected zone. Is there a way to map an account’s AZ name to the canonical one which apparently exists?

fdr3y ago

They gave up on that, now there's an extra "zone ID" you can read that maps to an absolute address. They used to be extremely cagey about giving out those mappings for your account.

Examples about how these relate: https://stackoverflow.com/questions/63283340/aws-map-between...

packetslave3y ago

The mapping from AZ Name (account-specific) to AZ ID (global) shows up on the EC2 overview page in the dashboard.

nathanwh3y ago

Yep! https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

res0nat0r3y ago

This is true, at the AWS account level. us-east-2a for my account may map to the internal use2-az1, but in your account us-east-2a may map internally to use2-az2.

dmalvarado3y ago· 7 in thread

I dunno how else to put it. Having EVERYTHING on AWS is a national security threat.

This isn't good, and someone who can do something about it needs to.

gtirloni3y ago

Good thing we don't have EVERYTHING on AWS, so no threat detected.

smt883y ago

The nature of our business means it wasn't a big deal, but I could imagine lots of people were in the same boat.

harryh3y ago

There is no "someone" who could do anything about this.

outworlder3y ago

Yeah, let's place everything in large colos instead. Those never fail, right?

Nextgrid3y ago

But the colos aren't usually managed by a single control plane controlled by a single company, so while they can all fail, they will generally do so independently.

scubbo3y ago

Having _everything_ on a single AZ of AWS is, indeed, a problem.

Having everything well-architected on AWS is...well, it's a problem for reasons of monopoly and cost, but it's not a problem for availability.

Nextgrid3y ago

Certain "global" AWS services depend on a specific region (us-east-1 if I recall right) so if that goes down you could be in trouble.

corobo3y ago· 5 in thread

Ahh

Always check HN before trying to diagnose weird issues that shouldn't be connected

dang3y ago

Living a bit more dangerously at the moment as HN is still running temporarily on AWS. (I'd link to the threads about this from a few weeks ago but am on my phone ATM.)

corobo3y ago

I did notice it being a little slow but I'm also on 4G at the moment (it got the blame)

Syonyk3y ago

And the reason that works is because HN is mostly hosted on its own stuff, without weird dependencies on anything beyond "the servers being up" and "TCP mostly working."

EddySchauHai3y ago

I believe it's on AWS after its two servers broke at the same time the other day.

2 more replies

jmartens3y ago

Too bad that is the best we have!

kaustubhvp3y ago· 5 in thread

Looks like a lot of services are impacted including Cloudflare, Ping, Zoom and Datadog.

jgrahamc3y ago

Not sure why we're on that list. We run our own infrastructure and are not built on AWS.

svnpenn3y ago

Looks like Snap, Crackle and Pop are down as well.

CoastalCoder3y ago

> Looks like Snap, Crackle and Pop are down as well.

I don't work on cloud stuff, so I'm genuinely unsure if this is a joke.

1 more reply

collinvandyck763y ago

Hah, frequency illusion strikes again? I just learned about the derivatives past Jerk yesterday.

qeternity3y ago

Cloudflare uses AWS? For what?

MikeMacMan3y ago· 5 in thread

I'm seeing issues in 2a but not 2b. Anyone having issues in 2b?

silverlyra3y ago

You'll need to see which availability zone ID (e.g., use2-az3) corresponds to each zone in your account: https://aws.amazon.com/premiumsupport/knowledge-center/vpc-m...

edit: AWS identified this as a power loss in a single zone, use2-az1.

blahyawnblah3y ago

I wonder if this is done because people have a tendency or something to always create resources in 'A' (or some other AZ) and this helps spread things around.

And if I would have read the page the link points to better, that's exactly the reason

NovemberWhiskey3y ago

Just FYI; availability zone names in AWS are randomized between accounts - your "a" can be someone else's "f"

Physical identifiers for availability zones are like "use2-az1", ref.

https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

bradyd3y ago

https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...

NightMKoder3y ago

AWS availability zones (so like us-west-2b rather than us-west-2) are not the same between accounts. us-west-2b for you is something different than us-west-2b for everyone else.

linsomniac3y ago· 5 in thread

I understand that us-east is AWS's oldest and biggest facility, but Amazon seems to have more money than Croesus, why aren't they fixing/rebuilding/replacing us-east with something more modern?

ctvo3y ago

us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.

dragonwriter3y ago

> us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.

Also, because shared and global AWS resource are (or at least often behave as if they are) intimately tied to us-east-1.

philihp3y ago

_xnmw3y ago

Because money can't fix everything? In fact sometimes having too much money makes it worse, as YC startup wisdom says.

linsomniac3y ago

I've avoided responding since my reply that started this was downvoted... But...

Agreed, money can't solve everything. BUT, this seems like an extremely solvable problem. That's why I'm so surprised.

hakube3y ago· 4 in thread

I'm running Terraform and it appears to be stuck now. What do I do??

danw19793y ago

eurasiantiger3y ago

Thanks, this comment made it very clear to me that I never want to touch a terraform system.

3 more replies

outworlder3y ago

Control+C (once!) is usually enough to cause it to abort without any ill effects to the state file. If it really got stuck and you have to kill it, then sure, you might have to mess with it a bit.

NeckBeardPrince3y ago

Wait

eatonphil3y ago· 4 in thread

Wonder if this is why Zoom is down. Wasn't able to connect just now. The connection proxy/sites were giving 504s.

jasonjayr3y ago

IIRC Zoom signed up with Oracle Cloud when COVID hit and they needed to scale like crazy.

https://www.oracle.com/customers/zoom/

I'm not sure if Zoom has any Critical infra in AWS though.

easton3y ago

I interviewed there a few months ago for DevOps, and one of the people I interviewed with said that most of Zoom was in AWS (they liked that I had AWS stuff on my resume).

bluedino3y ago

WebEx too

adrr3y ago

Okta is degraded as well.

bigfatfrock3y ago· 3 in thread

Suppose it's time to setup multi-az and pay to insure against AWS' own failures. I don't know why I previously thought their EC2 uptime claims were sufficient. Lesson learned.

Johnny5553y ago

https://aws.amazon.com/compute/sla/

99.5% availability allows up to about 3 and a half hours of downtime a month. 99.99% means around 4 minutes a month. So if you can't handle hours of downtime, you should definitely be multi-AZ.

okdood643y ago

chronid3y ago

Multi-AZ is a requirement on production level loads if you cannot sustain prolonged downtime.

Datacenters do end up completely dying now and then, you really want to have a good strategy in that case. Or not, if that's not required.

kaustubhvp3y ago· 3 in thread

AWS acknowledged the issue https://health.aws.amazon.com/health/status

allenrb3y ago

rwky3y ago

Funny I failed away from the zone and RDS still doesn't work, connections fail.

Edit: 2 minutes after I post this it starts working.

allenrb3y ago

"[10:11 AM PDT] We are investigating network connectivity issues for some instances and increased error rates and latencies for the EC2 APIs within the US-EAST-2 Region."

aaur03y ago· 2 in thread

AWS reported the outage here : https://health.aws.amazon.com/health/status

monocasa3y ago

> Severity

> Informational

lol

Analemma_3y ago

Three things are certain: death, taxes, and useless cloud status dashboards.

1 more reply

mrwnmonm3y ago· 1 in thread

And I thought us-east-2 is the way to escape us-east-1's problems.

snapcaster3y ago

packetslave3y ago· 1 in thread

On the bright side, this is certainly a good test for "exactly how resilient are our AWS-based systems to the loss of a single availability zone?"

dudeinjapan3y ago

Very resilient, provided that its not the AZ I'm using.

02thoeva3y ago· 1 in thread

Thanks for sharing. We've just spent the last hour debugging our website, thinking we had issues. This explains it.

jmartens3y ago

Interestingly, we saw a bunch of other services degrade (Zoom, Zendesk, Datadog) before AWS services themselves degrade.

allenrb3y ago· 1 in thread

us-east-2 customer here, having a variety of "strange" issues including inability to reach an RDS database, other users in my firm having VPN reconnect trouble to that region.

allenrb3y ago

FWIW, most of our issues just resolved ~1 minute ago. We'll see if it remains stable.

numpad03y ago

  $ dig news.ycombinator.com
  ;; ANSWER SECTION:
  news.ycombinator.com.   1       IN      A       50.112.136.166

  $ dig -x 50.112.136.166
  ;; ANSWER SECTION:
  166.136.112.50.in-addr.arpa. 300 IN     PTR     ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

saving couple keypresses just in case

ctur3y ago

I find this spreadsheet handy for thinking about AWS region-wide outages and frequency. This seems to be the first major us-east-2 outage, indeed, vs us-east-1 and other regions.

https://docs.google.com/spreadsheets/d/1Gcq_h760CgINKjuwj7Wu... (from https://awsmaniac.com/aws-outages/)

joshstrange3y ago

ChrisArchitect3y ago

We have always been at war with us-east-2.

kaustubhvp3y ago

Zoom is having connectivity issues. Nothing on https://status.zoom.us/ page yet.

weeeeelp3y ago

Anyone's ECR endpoints went out during the outage? We've had timeouts while pulling images onto our k8s cluster post-restart

jonatron3y ago

rwky3y ago

Same. Our EC2 instances can't connect RDS and just got 500 errors on the dashboard.

jmartens3y ago

Can confirm based on what Metrist is seeing. Looks to be a larger issue in the US East, also seeing Cloudfare and Datadog with issues.

nemothekid3y ago

I've somehow dodged region outages on AWS for years, and here's my first one. So many alerts firing off in unexpected ways.

sjm-lbm3y ago

Same here - we were finally able to log in to the console, but we're in us-east-2 and are having a ton of issues.

whendon3y ago

Tons of issues in us-east-2 here as well

tpl3y ago

Personally only saw intermittent failures through out. Rather minor production as far as AWS outages go.

mathgladiator3y ago

No wonder my error logs were clean. I can get into my hosts, but my LB isn't routing. Sad face.

bkruse3y ago

Lots of issues in us-east-2 for instances for us but also other regions when connecting to RDS

nnf3y ago

Lots of things intermittently unreachable in us-east-2 for us, across multiple AWS accounts.

jupp0r3y ago

Chaos monkey is on the move again.

jmartens3y ago

Anyone else notice similar issues in US-West-2 a few hours before this issue in US-East-2?

myroon53y ago

unfortunately one of the largest regions:

https://github.com/patmyron/cloud/#ip-addresses-per-region

smm113y ago

Had Rackspace login issues earlier today. Hmmmmmm ...

ilikecakeandpie3y ago

us-east-2 customer here also having some issues

jdugan3y ago

We have RDS and ECS issues in us-east-2

kaustubhvp3y ago

yeah looks like us-east-2 has networking issues

Justin_K3y ago

Can confirm....

annamarie3y ago

Same

j / k navigate · click thread line to collapse