undefined | Better HN

0 pointsjacquesm8mo ago0 comments

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

0 comments

61 comments · 18 top-level

padjo8mo ago· 26 in thread

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

mlrtime8mo ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

4 more replies

snowwrestler8mo ago

I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

1 more reply

davedx8mo ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

6 more replies

Waterluvian8mo ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

2 more replies

throw0101d8mo ago

> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

1 more reply

psychoslave8mo ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.

kxrm8mo ago

Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

lumost8mo ago

This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.

1 more reply

maerF0x08mo ago

Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

2 more replies

coffeebeqn8mo ago

We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone

pyrale8mo ago

What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.

dangoldin8mo ago

I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

Spooky238mo ago

Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

antihero8mo ago

My website running on an old laptop in my cupboard is doing just fine.

2 more replies

lucideer8mo ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

1 more reply

nucleardog8mo ago

> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".

1 more reply

YouAreWRONGtoo8mo ago

More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.

1 more reply

croes8mo ago

Telefonica is moving it 5G core network to AWS

https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...

A few hours could be a problem.

Not to mention it creates valuable a single point of failure for a hostile attack.

zaphirplane8mo ago

> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

2 more replies

sreekanth8508mo ago

Depends on how serious you are with SLA's.

indoordin0saur8mo ago

Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.

temperceve8mo ago

Depends on the business. For 99% of them this is for sure the right answer.

delfinom8mo ago

In before meteor strike takes a AWS region and they cant restore data.

kelseydh8mo ago

It seems like this can be mostly avoided by not using us-east-1.

DiffEq8mo ago

Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...

jacquesmOP8mo ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

3 more replies

hvb28mo ago· 4 in thread

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup8mo ago

> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

1 more reply

jacquesmOP8mo ago

I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

1 more reply

chasd008mo ago

Decentralized with respect to connectivity. If a construction crew cuts a fiber bundle routing protocols will route around the damage and packets keep showing up at the destination. Or, only a localized group of users will be affected. That level of decentralization is not what we have at higher levels in the stack with AWS being a good example.

Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.

psychoslave8mo ago

Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.

Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.

ho_schi8mo ago· 3 in thread

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.

We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

CaptainOfCoit8mo ago

The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

3 more replies

bombcar8mo ago

The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

smaudet8mo ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

rco87868mo ago· 3 in thread

If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.

CaptainOfCoit8mo ago

Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.

For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.

chasd008mo ago

what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.

apexalpha8mo ago

Or Trump decided your country does not deserve it.

1 more reply

raincole8mo ago· 2 in thread

Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.

paulddraper8mo ago

Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

Frieren8mo ago

> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

2 more replies

anal_reactor8mo ago· 2 in thread

First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesmOP8mo ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave8mo ago

>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.

JamesSwift8mo ago· 1 in thread

What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?

fisf8mo ago

Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.

There is no reason to have such brittle infra.

1 more reply

freetanga8mo ago· 1 in thread

Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesmOP8mo ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

bschne8mo ago· 1 in thread

I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.

jacquesmOP8mo ago

> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

1 more reply

pmontra8mo ago

In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

rglover8mo ago

It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.

OfflineSergio8mo ago

I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.

csomar8mo ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

bongodongobob8mo ago

You simply cannot avoid it. There are so many applications and services that use AWS. Companies cant sit on 100% in-house software stacks.

invalidusernam38mo ago

What if the fall-back also never comes back up?

kbar138mo ago

the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"

Keyframe8mo ago

At least we've got github steady with our code and IaaC, right? Right?!

saltyoldman8mo ago

Contrast this with the top post.

j / k navigate · click thread line to collapse

0 comments

61 comments · 18 top-level

padjo8mo ago· 26 in thread

mlrtime8mo ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not very many people realize that there are some services that still run only in us-east-1.

4 more replies

snowwrestler8mo ago

I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

1 more reply

davedx8mo ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

6 more replies

Waterluvian8mo ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

2 more replies

throw0101d8mo ago

> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

1 more reply

psychoslave8mo ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

kxrm8mo ago

Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

lumost8mo ago

1 more reply

maerF0x08mo ago

Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

2 more replies

coffeebeqn8mo ago

pyrale8mo ago

What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

dangoldin8mo ago

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

Spooky238mo ago

Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

antihero8mo ago

My website running on an old laptop in my cupboard is doing just fine.

2 more replies

lucideer8mo ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

1 more reply

nucleardog8mo ago

> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

1 more reply

YouAreWRONGtoo8mo ago

More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.

1 more reply

croes8mo ago

Telefonica is moving it 5G core network to AWS

https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...

A few hours could be a problem.

Not to mention it creates valuable a single point of failure for a hostile attack.

zaphirplane8mo ago

> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

2 more replies

sreekanth8508mo ago

Depends on how serious you are with SLA's.

indoordin0saur8mo ago

Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.

temperceve8mo ago

Depends on the business. For 99% of them this is for sure the right answer.

delfinom8mo ago

In before meteor strike takes a AWS region and they cant restore data.

kelseydh8mo ago

It seems like this can be mostly avoided by not using us-east-1.

DiffEq8mo ago

Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...

jacquesmOP8mo ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

3 more replies

hvb28mo ago· 4 in thread

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup8mo ago

> Decentralized in terms of many companies making up the internet

1 more reply

jacquesmOP8mo ago

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

1 more reply

chasd008mo ago

psychoslave8mo ago

ho_schi8mo ago· 3 in thread

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.

We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

CaptainOfCoit8mo ago

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

3 more replies

bombcar8mo ago

The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

smaudet8mo ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

rco87868mo ago· 3 in thread

CaptainOfCoit8mo ago

chasd008mo ago

what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.

apexalpha8mo ago

Or Trump decided your country does not deserve it.

1 more reply

raincole8mo ago· 2 in thread

paulddraper8mo ago

Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

Frieren8mo ago

> Most companies just aren't important enough to worry about "AWS never come back up."

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

2 more replies

anal_reactor8mo ago· 2 in thread

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesmOP8mo ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave8mo ago

>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

JamesSwift8mo ago· 1 in thread

What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?

fisf8mo ago

Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.

There is no reason to have such brittle infra.

1 more reply

freetanga8mo ago· 1 in thread

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesmOP8mo ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

bschne8mo ago· 1 in thread

jacquesmOP8mo ago

> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

1 more reply

pmontra8mo ago

For small and medium sized companies it's not easy to perform an accurate due diligency.

rglover8mo ago

OfflineSergio8mo ago

I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.

csomar8mo ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

bongodongobob8mo ago

You simply cannot avoid it. There are so many applications and services that use AWS. Companies cant sit on 100% in-house software stacks.

invalidusernam38mo ago

What if the fall-back also never comes back up?

kbar138mo ago

the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"

Keyframe8mo ago

At least we've got github steady with our code and IaaC, right? Right?!

saltyoldman8mo ago

Contrast this with the top post.

j / k navigate · click thread line to collapse