undefined | Better HN

0 pointsindoordin0saur6mo ago0 comments

Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.

0 comments

markus_zhang6mo ago

It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.

rdtsc6mo ago

9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.

2 more replies

Veserv6mo ago

You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.

2 more replies

codeduck6mo ago

I'm sure they'll find some way to weasel out of this.

5 more replies

hinkley6mo ago

Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?

beachy6mo ago

I haven't done any RFP responses for a while but this question always used to make me furious. Our competitors (some of who had had major incidents in the past) claimed 99.99% availability or more, knowing they would never have to prove it, and knowing they were actually 100% until the day they weren't.

We were more honest, and it probably cost us at least once in not getting business.

1 more reply

procaryote6mo ago

it's a matter of perspective... 9.9999% is real easy

1 more reply

hvb26mo ago

It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.

3 more replies

oxfordmale6mo ago

But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s

outworlder6mo ago

I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.

andrewl-hn6mo ago

Several reasons, really:

1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"

2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.

3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.

4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.

5. Many Amazon features are available in that region first and then spread out to other locations.

6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.

7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?

It's the world's default hosting location, and today's outages show it.

2 more replies

jedberg6mo ago

Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.

3 more replies

indoordin0saurOP6mo ago

We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.

sleepybrett6mo ago

So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.

1 more reply

lordnacho6mo ago

Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.

5 more replies

oofbey6mo ago

One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.

nijave6mo ago

For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline

Eridrus6mo ago

This was a major issue, but it wasn't a total failure of the region.

Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.

I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.

We definitely learnt something here about both our software and our 3rd party dependencies.

perching_aix6mo ago

cheapest + has the most capacity

throwaway-aws96mo ago

You have to remember that health status dashboards at most (all?) cloud providers require VP approval to switch status. This stuff is not your startup's automated status dashboard. It's politics, contracts, money.

hinkley6mo ago

Which makes them a flat out lie since it ceases to be a dashboard if it’s not live. It’s just a status page.

PeterCorless6mo ago

Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).

That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).

However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).

Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.

rogerrogerr6mo ago

Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.

2 more replies

mjrpes6mo ago

Down detector agrees: https://downdetector.com/status/amazon/

Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status

ilamont6mo ago

Search, Seller Central, Amazon Advertising not working properly for me. Attempting to access from New York.

When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.

1 more reply

belter6mo ago

This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?

https://health.aws.amazon.com/health/status?path=open-issues

The closest to their identification of a root cause seems to be this one:

"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."

hinkley6mo ago

I wonder how many people discovered their autoscaling settings went batshit when services went offline, either scaling way down or way up, or went metastable and started fishtailing.

jread6mo ago

Lambda create-function control plane operations are still failing with InternalError for us - other services have recovered (Lambda, SNS, SQS, EFS, EBS, and CloudFront). Cloud availability is the subject of my CS grad research, I wrote a quick post summarizing the event timeline and blast radius as I've observed it from testing in multiple AWS test accounts: https://www.linkedin.com/pulse/analyzing-aws-us-east-1-outag...

Forricide6mo ago

Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.

busymom06mo ago

That's probably why Reddit has been down too

whaleofatw20226mo ago

Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?

I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.

loudmax6mo ago

Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.

In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.

If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.

1 more reply

hinkley6mo ago

Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.

1 more reply

hinkley6mo ago

Sometimes I miss my phone buzzing when doing yard work. Diwali has to be worse for that.

junon6mo ago

Seeing as how this is us-east-1, probably not a lot.

2 more replies

napolux6mo ago

worst of all: ring alarm unstoppable siren because the app is down and the keyboard was removed by my parents and put "somewhere in the basement".

bartread6mo ago

Is it hard wired? If so, and if the alarm module doesn’t have an internal battery, can you go to the breaker box and turn off the circuit it’s on? You should be able to switch off each breaker in turn until it stops if you don’t know which circuit it’s on.

If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.

Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.

2 more replies

autophagian6mo ago

Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.

cj006mo ago

Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.

mvdtnz6mo ago

The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.

baubino6mo ago

Basic services at my worksite have been offline for almost 8 hours now (things were just glitchy for about 4 hours before that). This is nuts.

indoordin0saurOP6mo ago

Have not gotten a data pipeline to run to success since 9AM this morning when there was a brief window of functioning systems. Been incredibly frustrating seeing AWS tell the press that things are "effectively back to normal". They absolutely are not! It's still a full outage as far as we are concerned.

assholesRppl26mo ago

Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"

claudiug6mo ago

ServiceUnavailableException hello java :)

dutzi6mo ago

Here as well…

JCM96mo ago

Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.

tlogan6mo ago

I noticed the same thing and it seems to have gotten much worse around 8:55 a.m. Pacific Time.

By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.

wavemode6mo ago

SEV-0 for my company this morning. We can't connect to RDS anymore.

jmuguy6mo ago

Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.

davedx6mo ago

Andy Jassy is the Tim Cook of Amazon

Rest and vest CEOs

hinkley6mo ago

Don’t insult Tim Cook like that.

He got a lot of impossible shit done as COO.

They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.

perching_aix6mo ago

In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.

steveBK1236mo ago

Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.

A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.

hinkley6mo ago

Before my old company spun off, we didn’t know the old ops team had put on-prem production and our Atlassian instances in the same NAS.

When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.

Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.

jonplackett6mo ago

The problem is now that, what’s anyone going to do? Leave?

I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.

Same meme would work for Aws today.

MaKey6mo ago

> Same meme would work for Aws today.

Not really, there are enough alternatives.

1 more reply

hinkley6mo ago

It’s amazing how much you can avoid them by eating food that still looks like what it started as though. They own a lot of processed food.

ljdtt6mo ago

first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)

sorentwo6mo ago

It is an old US military term that means “F*ked Up Beyond All Recognition”

3 more replies

vishnugupta6mo ago

It used to be quite common but has fallen out of usage.

loudmax6mo ago

"FUBAR" comes up in the movie Saving Private Ryan. It's not a plot point, but it's used to illustrate the disconnect between one of the soldiers dragged from a rear position to the front line, and the combat veterans in his squad. If you haven't seen the movie, you should. The opening 20 minutes contains one of the most terrifying and intense combat sequences ever put to film.

1 more reply

strictnein6mo ago

FUBAR: Fucked Up Beyond All Recognition

Somewhat common. Comes from the US military in WW2.

parliament326mo ago

Yes, although it's military in origin.

j / k navigate · click thread line to collapse

0 comments

markus_zhang6mo ago

It has been quite a while, wondering how many 9s are dropped.

365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.

rdtsc6mo ago

9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.

2 more replies

Veserv6mo ago

You added a zero. There are ~8760 hours per year, so 8 hours is ~1 in 1000, 99.9%.

2 more replies

codeduck6mo ago

I'm sure they'll find some way to weasel out of this.

5 more replies

hinkley6mo ago

Where were you guys the other day when someone was calling me crazy for trying to make this same sort of argument?

beachy6mo ago

We were more honest, and it probably cost us at least once in not getting business.

1 more reply

procaryote6mo ago

it's a matter of perspective... 9.9999% is real easy

1 more reply

hvb26mo ago

It's a single region?

I don't think anyone would quote availability as availability in every region I'm in?

While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.

3 more replies

oxfordmale6mo ago

But is is a partial outage only, so it doesn't count. If you retry a million times everything still works /s

outworlder6mo ago

I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.

andrewl-hn6mo ago

Several reasons, really:

1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"

2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.

3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.

4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.

5. Many Amazon features are available in that region first and then spread out to other locations.

It's the world's default hosting location, and today's outages show it.

2 more replies

jedberg6mo ago

Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.

3 more replies

indoordin0saurOP6mo ago

sleepybrett6mo ago

So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.

1 more reply

lordnacho6mo ago

Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.

5 more replies

oofbey6mo ago

One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.

nijave6mo ago

For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline

Eridrus6mo ago

This was a major issue, but it wasn't a total failure of the region.

I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.

We definitely learnt something here about both our software and our 3rd party dependencies.

perching_aix6mo ago

cheapest + has the most capacity

throwaway-aws96mo ago

hinkley6mo ago

Which makes them a flat out lie since it ceases to be a dashboard if it’s not live. It’s just a status page.

PeterCorless6mo ago

Downdetector had 5,755 reports of AWS problems at 12:52 AM Pacific (3:53 AM Eastern).

That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).

However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).

Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.

rogerrogerr6mo ago

Where do they source those reports from? Always wondered if it was just analysis of how many people are looking at the page, or if humans somewhere are actually submitting reports.

2 more replies

mjrpes6mo ago

Down detector agrees: https://downdetector.com/status/amazon/

Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status

ilamont6mo ago

Search, Seller Central, Amazon Advertising not working properly for me. Attempting to access from New York.

When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.

1 more reply

belter6mo ago

This looks like one their worst outage in 15 years and us-east-1 still shows as degraded but I had no outages, as dont use us-east-1. Are you seeing issues on other regions?

https://health.aws.amazon.com/health/status?path=open-issues

The closest to their identification of a root cause seems to be this one:

hinkley6mo ago

I wonder how many people discovered their autoscaling settings went batshit when services went offline, either scaling way down or way up, or went metastable and started fishtailing.

jread6mo ago

Forricide6mo ago

Definitely seems to be getting worse, outside of AWS itself, more websites seem to be having sporadic or serious issues. Concerning considering how long the outage has been going.

busymom06mo ago

That's probably why Reddit has been down too

whaleofatw20226mo ago

Dangerous curiosity ask, is whether the number of folks off for Diwali is a factor or not?

I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.

loudmax6mo ago

Northern Virginia's Fairfax County public schools have the day off for Diwali, so that's not an unreasonable question.

1 more reply

hinkley6mo ago

Seems like a lot of people missing that this post was made around midnight PST time and thus it would be more reasonable to ping people at lunch in IST before waking up people in EST or PST.

1 more reply

hinkley6mo ago

Sometimes I miss my phone buzzing when doing yard work. Diwali has to be worse for that.

junon6mo ago

Seeing as how this is us-east-1, probably not a lot.

2 more replies

napolux6mo ago

worst of all: ring alarm unstoppable siren because the app is down and the keyboard was removed by my parents and put "somewhere in the basement".

bartread6mo ago

Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.

2 more replies

autophagian6mo ago

Yeah. We had a brief window where everything resolved and worked and now we're running into really mysterious flakey networking issues where pods in our EKS clusters timeout talking to the k8s API.

cj006mo ago

Yeah, networking issues cleared up for a few hours but now seem to be as bad as before.

mvdtnz6mo ago

The problems now seem mostly related to starting new instances. Our capacity is slowly decaying as existing services spin down and new EC2 workloads fail to start.

baubino6mo ago

Basic services at my worksite have been offline for almost 8 hours now (things were just glitchy for about 4 hours before that). This is nuts.

indoordin0saurOP6mo ago

assholesRppl26mo ago

Yep, confirmed worse - DynamoDB now returning "ServiceUnavailableException"

claudiug6mo ago

ServiceUnavailableException hello java :)

dutzi6mo ago

Here as well…

JCM96mo ago

Agree… still seeing major issues. Briefly looked like it was getting better but things falling apart again.

tlogan6mo ago

I noticed the same thing and it seems to have gotten much worse around 8:55 a.m. Pacific Time.

By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.

wavemode6mo ago

SEV-0 for my company this morning. We can't connect to RDS anymore.

jmuguy6mo ago

Yeah we were fine until about 1030 eastern and have been completely down since then, Heroku customer.

davedx6mo ago

Andy Jassy is the Tim Cook of Amazon

Rest and vest CEOs

hinkley6mo ago

Don’t insult Tim Cook like that.

He got a lot of impossible shit done as COO.

They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.

perching_aix6mo ago

In addition to those, Sagemaker also fails for me with an internal auth error specifically in Virginia. Fun times. Hope they recover by tomorrow.

steveBK1236mo ago

Agreed, every time the impacted services list internally gets shorter, the next update it starts growing again.

A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.

hinkley6mo ago

Before my old company spun off, we didn’t know the old ops team had put on-prem production and our Atlassian instances in the same NAS.

When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.

jonplackett6mo ago

The problem is now that, what’s anyone going to do? Leave?

I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.

Same meme would work for Aws today.

MaKey6mo ago

> Same meme would work for Aws today.

Not really, there are enough alternatives.

1 more reply

hinkley6mo ago

It’s amazing how much you can avoid them by eating food that still looks like what it started as though. They own a lot of processed food.

ljdtt6mo ago

first time i see "fubar", is that a common expression on the industry? jsut curious (english is not my native language)

sorentwo6mo ago

It is an old US military term that means “F*ked Up Beyond All Recognition”

3 more replies

vishnugupta6mo ago

It used to be quite common but has fallen out of usage.

loudmax6mo ago

1 more reply

strictnein6mo ago

FUBAR: Fucked Up Beyond All Recognition

Somewhat common. Comes from the US military in WW2.

parliament326mo ago

Yes, although it's military in origin.

j / k navigate · click thread line to collapse