Skip to content

Top Best Ask Show New Jobs

AWS's us-east-1 region is experiencing issues (opens in new tab)

(health.aws.amazon.com)

156 pointszaltekk4y ago160 comments

160 comments

87 comments · 22 top-level

halestock4y ago· 25 in thread

I can't help but wonder, with the increases in attrition across the industry, are we hitting some kind of tipping point where the institutional knowledge in these massive tech corporations is disappearing?

Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.

thethethethe4y ago

Disc: Googler opinions are my own.

I've been an SRE on a tier 1 GCP product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.

It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.

In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.

This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.

dijit4y ago

I’ll be honest. My external impression of Amazon and Google could not be more distant in this regard.

Google doesn’t have nearly as hard a time retaining good people as Amazon does.

zwirbl4y ago

Just like the tech priests in Warhammer 40k, keeping occult old engineering, thatno one could build anymore, running

SketchySeaBeast4y ago

If we want to normalize letting long term support people call themselves tech priests I'd very much appreciate it.

"What were your duties at your last position?" "Performing the daily ministrations and singing the praise of the machine god."

hughrr4y ago

So today I find out my job title is tech priest. I was happy with necromancer before. Does it come with a pay rise?

Not familiar with 40k. Was it a similar idea to nuclear-power-as-religion from Foundation?

9wzYQbTYsAIc4y ago

Outside of the mega-fang industry, I’m wondering the same thing.

The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.

COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).

Lascaille4y ago

>The Great Resignation had to have taken a huge toll on regular enterprises.

The Great Resignation was people leaving jobs they didn't want (front of house/service industry/gig) for jobs they did want (career-track jobs) after those jobs resumed hiring again after the pandemic settled down.

Labor force participation went up not down due to 'the great resignation.'

You're right, but that's been true since the beginning of the tech boom (but isn't exclusive to tech) when no one works for a place for several decades. Companies weather this in different ways but attrition has always been around.

What's causing people to believe that the latest round of attrition is any different?

hkt4y ago

I'd speculate that perhaps more senior people are moving, and/or a greater overall rate of attrition combined with much more complex technologies and organisations. In other words, it might be harder to become good at jobs now, and fewer people stick with them. Just a hunch but definitely seems to be where the incentives point with loyalty penalties and tech bloat.

anonporridge4y ago

Promoting high employee turnover could actually be a very effective strategy for a company's long term sustainability.

If your company is hostile to people sticking around for decades, then it makes it that much less likely that you end up stuck with an machine that relies heavily on poorly documented tribal knowledge that's likely to start falling apart as your core people start cashing out.

9wzYQbTYsAIc4y ago

> What's causing people to believe that the latest round of attrition is any different?

The Great Recession

The Great Resignation

The Great Dying (due to COVID-19)

steveBK1234y ago

Yes, absolutely. Within my own org of ~50 people, 15% have resigned/contracts ending during Q1 (after 15% in Q4). Of the remaining 85%.. 20% have been around since before COVID / 65% joined during COVID.

Of our senior engineers & team leads, 70% have joined in last 6-9 months.

Only 3 full time senior engineers with 2 years or more tenure.

We've grown during COVID but we've also just burned through people.

Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.

Newest member of my team has been in the company for 6 years and on my team for 4. I was in the local pub the other day and there was a retirement do for someone who had been here for 35 years, which certainly isn't exceptional (40 years is more noteworthy, and I've known a couple of people who made it to 50 years)

9wzYQbTYsAIc4y ago

Similar story in smaller orgs, from what I’ve heard - people don’t even bother with the going-away stuff any more.

I think this is a transient issue. When you're in growth mode you make a huge series of hacks to just keep things running and then when you leave.... well, it's a problem. But if the business is robust, and lives beyond you, what replaces your work is better documented, better tested, and maintainable.

That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.

Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.

I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.

replygirl4y ago

if us-east-1 is "in growth mode", what that we rely on can we possibly expect to ever reach maturity?

Yep. They literally need to start doubling pay to retain people. The attrition this year is going to be devastating.

That's the problem we're out to solve with robusta.dev.

We're slowly but surely converting the world's institutional technical knowledge into re-usable and automated runbooks.

hughrr4y ago

I’m just going to have to spend all day fixing the runbooks as well as the technology ;)

Resolved in 7 mins. Can you do better?

My monitoring doesn't remember the last time we had a service outage lasting 7 seconds, let alone 7 minutes.

9wzYQbTYsAIc4y ago

Nice work, keep it up. It’s surely not easy managing that many companies computer systems.

newobj4y ago

No

9wzYQbTYsAIc4y ago

That’s good to hear. How wide is your scope of experience?

operator14y ago· 14 in thread

What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data. Are there issues affecting the internet backbone or something? Or just a coincidence?

300bps4y ago

Important to keep in mind that AWS has 250 services in 84 Availability Zones in 26 regions.

This outage is reportedly impacting 5 services in 1 region.

For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.

late2part4y ago

Counterpoint:

us-east-1 is their largest reason, someone told me it's 50% of their revenue

multi-region failover is awfully, awfully expensive.

In my last 7 years I imagine we had ~1 outage a month on average from AWS failures, but who knows if my imagination is accurate.

quxbar4y ago

For businesses with uptime guarantees and lots of boxes to spin up in failover scenario, this has been a very eventful 12 months. At least that's what I'm experiencing.

super_linear4y ago

Absolutely no way to prove this but maybe Q1 deadlines coming up and people trying to launch things and make changes?

frays4y ago

Increase in attrition across the industry.

A lot of institutional knowledge in these massive tech corporations is disappearing and we're starting to reach the tipping point.

thethethethe4y ago

See my comment here: https://news.ycombinator.com/item?id=30621190

But there's always been attrition. What are some of the ways that is now different that is affecting attrition rates and their effects?

nix0n4y ago

A handful of large-traffic sites have recently, and relatively suddenly, started blocking traffic from a large region. That's a major change in flow.

WinterMount2234y ago

Could you be more specific?

thethethethe4y ago

> What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data.

Source?

Russian war is another juicy possibility

adamrezich4y ago

told myself I'd click this submission's comments link, CTRL+F `Russia`, & quit HN for the day if anything came up, thanks for not disappointing

9wzYQbTYsAIc4y ago

Elevated risk of cyberattacks due to foreign meddling.

https://www.cisa.gov/shields-up

btgeekboy4y ago

This is a pretty big claim to make. Do you have any sources that back it up?

xilni4y ago· 13 in thread

This is why you are strongly urged not to rely on one region or AZ.

pid-14y ago

Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.

Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)

But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.

evrydayhustling4y ago

Those costs are the actual reason you are encouraged to go multi-AZ!

(I actually love that we have strategies and infrastructure for multi-region... it just tends to come up at scales and for applications where it is not justified.)

systemvoltage4y ago

Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.

What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.

It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".

Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.

9wzYQbTYsAIc4y ago

> What's the point of cloud if we have to manage robustness of their own infrastructure.

Worth deliberating on. I’m curious as to what the lifetime cost of ownership for an on-prem data center is relative to lifetime cost of operating in the cloud.

They would simply charge for the privilege. An EC2 'always on' or whatever option that enabled your instance to live migrate between availability zones would be a nice and expensive option.

Johnny5554y ago

I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.

Not sure if it's still the case, but when I was there us-east-1 was a SPOF for some services world wide. I think if dynamodb went down in the region it was a big, big issue.

m344y ago

Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative

tyingq4y ago

Good advice, though AWS still has some services that don't work completely independently. Cloudfront, because of certificates. Route53. The control API for IAM (adding/removing roles, etc). And I wish they didn't have global-looking endpoints (like https://sts.amazonaws.com) that aren't really global or resilient.

ranman4y ago

STS will let you use regional endpoints now, right?

didip4y ago

Multi AZ is great and should be by default, but multi Region is expensive.

hughrr4y ago

This. We have multi AZ in more than one region and I occasionally dream of Bezos wearing only a top hat and waistcoat laughing manically while diving into a large vat of gold coins.

jamesfinlayson4y ago

Not always possible - Australia (currently) only has one availability zone and if you're in a regulated industry (banking or government stuff) they require data to be in Australia.

saltypal4y ago· 6 in thread

Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.

It takes a while to find a Vice President, I guess.

mcqueenjordan4y ago

Or perhaps triaging, root-causing, and fixing the issue is the highest-order bit?

Different people have different responsibilities. At Amazon scale, the comms and people doing a deep dive to fix stuff will not be the same.

nostrebored4y ago

It definitely is. For an issue like this, you will see relevant teams and delegates looped in very quickly. Getting approved wording about an outage requires some very senior people though. Often they have to be paged in as well.

Having worked at a few other large tech companies now -- Amazon's incident response process is honestly great. It's one of the things I miss about working there.

Separate teams. We have a tiny team and even _we_ appoint a group to fix and a group or individual to do nothing but communicate.

sure, but if those people are updating the status pages to say something isn't right and we're looking into it, we're doomed.

mhio4y ago

The truth assuaging usually takes 15-30 minutes.

jasoneckert4y ago· 2 in thread

Maybe the reason AWS keeps going down is because they run all their stuff on-prem...

merb4y ago

I'm not sure if gcloud or azure would help. I run two servers on hetzner which is way cheaper than azure/gcloud they would be better off there.

9wzYQbTYsAIc4y ago

They might benefit from migrating to the Azure cloud. I’ve heard that some of the Windows servers actually run faster than some of the Linux servers on Azure.

etaioinshrdlu4y ago· 2 in thread

Does AWS have a plan to improve this region?

Do they acknowledge the problem?

It's been a joke for years how bad us-east-1 is.

consumer4514y ago

Nuke the entire site from orbit

It's the only way to be sure

consumer4514y ago

Just in case PT's Stasi as a Service company ($PLTR) has a hard time parsing this, I want to make clear that this is a joke based on a quote from Aliens the movie. I am not endorsing anything violent.

It is a joke.

I would delete my parent comment if I could.

didip4y ago· 1 in thread

This is why us-east-1 is perfect for chaos-testing, non-prod, environment.

TameAntelope4y ago

Yeah, if you're still running only in us-east-1 at this point, you kind of asked for it...

mtrunkat4y ago· 1 in thread

In our case (Apify.com) there was a complete outage of SQS (15mins+), most likely DNS problems + EC2 instances got restarted probably as a result of an SQS outage.

EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.

BigGreenTurtle4y ago

Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for a while. Seems okay now though.

0xCAP4y ago· 1 in thread

Is us-east cursed or what?!

It is just the one that everyone uses...

easton4y ago

From temuze last time:

"If you're having SLA problems I feel bad for you son

I got two 9 problems cuz of us-east-1"

Protip to anyone building new infrastructure in AWS: If you're gonna only use one region in the US, make it us-east-2 or us-west-2. us-east-1 is their oldest and biggest region but also the least stable in the US (ok technically us-west-1 is worse but you can't get that one anymore).

fotta4y ago

Somehow AWS managed to make their new status page more opaque than the old one. It's like they want you to scroll through their gigantic list so they can fix the issue before you find the right line.

PeterBarrett4y ago

SQS went down for us in us-east-1 and we lost health checks on instances there. Fully recovered now.

karmakaze4y ago

It's a meme by this point that us-east-1 is not 'the cloud'--it's a snowflake, a pet, etc.

amar0c4y ago

My Aruba Instant ON Ap's are "offline" (orange) even tho they work and I am online. My first tought is that some Cloud went nirvana state

asah4y ago

us-east-1 again!

extant_lifeform4y ago

The URA target needs to be bumped up to 25%. Churn and burn.

noticed issues with SQS for a couple minutes. Errors from java sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com`

hughrr4y ago

I am no longer surprised and this is worrying.

0xbadcafebee4y ago

My schadenfreude is strong whenever us-east-1 goes down

stjohnswarts4y ago

Seems fine he

csdvrx4y ago

As usual?

j / k navigate · click thread line to collapse