undefined | Better HN

0 pointslmilcin4y ago0 comments

Think about it this way:

1) Can you make your on prem infrastructure go down less than Amazon's?

2) Is it worth it?

In my experience most people grossly underestimate how expensive it is to create reliable infrastructure and at the same time overestimate how important it is for their services to run uninterrupted.

EDIT: I am not arguing you shouldn't build your more reliable infrastructure. AWS is just a point on a spectrum of possible compromises between cost and reliability. It might not be right for you. If it is too expensive -- go for cheaper options with less reliability.

If it is too unreliable -- go build your own yourself, but make sure you are not making huge mistake because you may not understand what it actually costs to build to AWSs level.

For example, personally, not having to focus on infra reliability makes it possible for me to focus on other things that are more important to my company. Do I care about outages? Of course I do, but I understand doing this better than AWS has would cost me huge amount of focus on something that is not core goal of what we are doing. I would rather spend that time thinking how to hire/retain better people and how to make my product better.

And adding all that complexity of running this infra to my company would cause entire organisation be less flexible, which is also a cost.

So you can't look at cost of running the infra like a bill of materials for parts and services.

And if there is an outage it is good to know there is huge organisation there trying to fix it while my small organisation can focus preparing for what to do when it comes back up.

0 comments

29 comments · 11 top-level

Nextgrid4y ago· 10 in thread

> Can you make your on prem infrastructure go down less than Amazon's?

Obviously depends on what you need, but for a small to medium web app that needs a load-balancer, a few app servers, a database and a cache, yes absolutely - all of these have been solved problems for over a decade and aren't rocket science to install & maintain.

> Is it worth it?

I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.

> overestimate how important it is for their services to run uninterrupted.

Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.

laumars4y ago

I have run high availability (HA) systems in prem and your statement vastly understates the difficulty and expense.

You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre. Or the ISP themselves could suffer an outage.

You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages. You absolutely do not want to run out of diesel!

You need redundancy of every piece of hardware AND you need to test that failover works as expected because the last thing you need is a core switch to fail and traffic not to route over secondary core switch like expected.

You need your multiple air con units and them to be powered off different mains inputs so if the electrics fail on one unit it doesn’t take out the others. I guarantee you that if the air cons will fail, it will be on the hottest day of the year a month amount of portable units will stop your servers from overheating.

You need beefy UPS with multiple batteries. Ideally multiple UPSs with each UPS powering a different rail on your racks so that if one UPS fails your hardware is still powered from the other rail. And you need to regularly check the battery status and loads on the UPS. Remember that the back up generator takes a second or two to kick in so you need something to keep the power to the servers and networking hardware to be uninterrupted. And since all your hardware is powered via the UPS, if that dies you still lose power even if the building is powered.

And you then need to duplicate all of the above in second location just in case the first location still goes down.

By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.

The reason people move to the cloud for HA is because rolling your own is like rolling your own encryption: it’s hard, error prone, expensive, and even when you have the right people on the team there’s still a good chance you’ll fuck it up. AWS, for all its faults, does make this side of the job easier.

Nextgrid4y ago

That's true for on-prem infrastructure, but is all already handled for you if you rent servers from hosting providers such as OVH/Hetzner or even rent colocation space in an existing DC, and is still cheaper than the cloud equivalent (and as we saw recently, actually more reliable as well).

4 more replies

vinay_ys4y ago

It is much easier than you think. There are well-defined standards and trained trades people and whole host of companies who make great products and provide after-sales services to do it. Every major financial services, telecom and high-precision manufacturing companies run their infra this way. It is definitely less niche than rolling your own encryption.

2 more replies

jrockway4y ago

> You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre.

At my last job we provided redundant paths (including entry to your building) as an add-on service. So you might not need two ISPs if you're only worried about fiber cuts. You could still be worried about things like "we think all Juniper routers in the world will die at the exact same instant", in which case you need to make sure you pick an ISP that uses Cisco equipment. And of course, it's possible that your ISP pushes a bad route and breaks the entirety of their link to the rest of the Internet.

tzs4y ago

> You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages.

I don't see why the petrol station needs to be a short distance away. Unless the plan is to walk to the petrol station and back (which should not be the plan[1]), anyplace within reasonable driving distance should do.

[1] long duration electrical outages will often take out everything a short distance away, and the petrol stations usually have electric pumps.

2 more replies

phil214y ago

You really don't need almost any of this stuff. If you have small on-prem needs just grab a couple fiber links, try for diversity on paths for them (good luck), add some power backup if it fits your needs, and be done.

If you are going to the level of the above, you go with co-location in purpose built centers at a wholesale level. The "layer1" is all done to the specs you state and you don't have to worry about it.

On-prem rarely actually means physically on-prem at any scale beyond a small IT office room. It means co-locating in purpose built datacenters.

I'm sure examples exist, but the days of large corporate datacenters are pretty much long over - just inertia keeping the old ones going before they move to somewhere like Equinix or DRT. With the wholesalers you can basically design things to spec, and they build out 10ksqft 2MW critical load room for you a few months later.

A few organizations will find it worthwhile to continue to build at this scale (e.g. Visa, the government) but it's exceptionally small.

1 more reply

H1Supreme4y ago

> You need a back up generator and to be a short distance away from a petrol station

My building has a natural gas backup generator.

1 more reply

badams25274y ago

Human capital side would disagree with that I think. You're assuming the organization which owns this small/medium web app has the personnel already on staff to handle such a thing.

If you're outsourcing that, you'd likely have to pay a boatload just for someone to be available for help, let alone the actual tasks themselves. Like you said, if you're on-prem and something goes down, you can do something. But you've gotta have the personnel to actually do something.

That said, I think you're spot-on as long as you have the skillset already.

Retric4y ago

You still need to pay someone to manage AWS infrastructure. It’s possible to save money using AWS, but things often get more expensive.

1 more reply

Nextgrid4y ago

> Human capital side would disagree with that I think

I hear this argument a lot, but every startup I've been involved with had a full-time DevOps engineer wrangling Terraform & YAML files - that same engineer can be assigned to manage the bare-metal infrastructure.

1 more reply

Retric4y ago· 2 in thread

It really spends on how reliable you need to be. Don’t forget you get downtime from both AWS and your own issues so even 4 9’s is off the table with pure AWS. If you need to be more reliable than AWS you need to run a hybrid inside and outside of AWS which means most of the advantages of running on AWS goes away.

nostrebored4y ago

Very untrue. Many businesses with 4 9 SLAs are all in on AWS. It requires active/active setups though!

Retric4y ago

Many business claim 4 9 SLAs on AWS, but that doesn’t mean they actually provide it. It’s simply a question of what the penalties of failing to reach their SLA is.

autosharp4y ago· 2 in thread

Also, you can just take two different amazon regions and hope they don't both go down at the same time.

For extra safety, and extra work, you could even take Azure as a backup if you're not locked in with AWS.

dijit4y ago

forgive me repeating myself: AWS Zones are not truly independent of each other.

Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.

If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.

it does not matter if you're in timbuktu-1, you are dead in the water.

it is a myth that amazon availability zones are truly independent.

please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.

autosharp4y ago

Of course that depends on what services you use and yes, even then there is some remaining correlation just because it is the same host.

> are not truly independent of each other

Indeed. They are even on the same planet!

> please stop blaming the victim

Excuse me?

1 more reply

ocdtrekkie4y ago· 2 in thread

My on prem infrastructure goes down drastically less than Amazon's.

...My home Internet even is scoring better than Amazon right now, in fact. Yours probably is too.

lmilcinOP4y ago

I have a bolt lying on my desk.

It hasn't failed since 1970 when it was produced.

It must have been built better than Space Shuttle, then.

dijit4y ago

Sysadmin 102: Simple Systems Fail Less Often.

patentatt4y ago· 1 in thread

On the other hand, perhaps the large cloud providers bring a level of complexity that outweighs their skills at keeping everything up. What I mean is, a basic redundancy and failover setup with two data centers is kind of straightforward. Sure you need a person on call 24/7 to oversee it, but it's conceptually not that complicated. And if you're running bare metal, you get a surprising level of performance per dollar and rack unit. On the other hand, the big clouds are immensely complex with multiple layers of software defined networking, millions of tenants, thousands of employees, acres of floor space, org charts, etc. If you're running your own infra as one competent sysadmin, you know nobody else in another department will push a networking code change that will break your shit in the middle of the night. Maybe it's not right for everyone, but it's not unreasonable to go on prem in 2021 despite the popular opinions otherwise. Source: my company runs on prem and routinely has 100% uptime years. Most unplanned downtime occurs early on a Sunday morning following a planned action during a maintenance window.

sgarland4y ago

I was and continue to be surprised how reliable even old servers are. I run a small homelab (Debian VMs on Proxmox; a Docker host, a jumpbox, a NAS running ZFS, etc.) on seven year old hardware, and all of my problems are self-imposed. If I leave everything alone, it just works.

As a counterpoint, though, my last place had a large Java app, split between colo'd metal and AWS. Seemed like the colo'd stuff failed more (bad RAM mostly, a few CPUs, and an occasional PSU). Entirely anecdotal.

jerf4y ago· 1 in thread

I think if you put a bit of effort into classifying importance, you can likely justify backing up certain critical systems in more than one way. Let "the cloud" handle everyone's desktop backups and all the ancillary systems you don't really need immediately to do business, but certain important systems should perhaps be backed up both to the cloud and locally, like Windows Domain Controllers and other things you can't do anything without.

Backup is cheap when you're focused about what you're backing up.

In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!

"In more than one way" doesn't have to be local, but it may be across multiple cloud services. Still, "local" is nice in that it doesn't require the Internet. ("The Internet" doesn't tend to go down, but the portion you are on certainly can.) Of course, as workers disperse, "local" means less and less nowadays.

kaashif4y ago

> In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon.

It's possible to go down in a mostly uncorrelated way to Amazon by just being down all the time.

Obviously this is implicit in your comment, but I'll say it anyway: your backups need to actually work when you need them. You need to test them (really test them) to make sure they're not secretly non-functional in some subtle way when Amazon is really down.

pkulak4y ago

> Can you make your on prem infrastructure go down less than Amazon's?

Over the last two years, my track record has destroyed AWS. I've got a single Mac Mini with two VMs on it, plugged in to a UPS with enough power to keep it running for about three hours. It's never had a second of unplanned downtime.

About 15 years ago I got sick of maintaining my own stuff. I stopped building Linux desktops and bought an Apple laptop. I moved my email, calendars, contacts, chat, photos, etc, to Google. But lately I've swung 180 degrees and have been undoing all those decisions. It's not as much of a PITA as I remember. Maybe I'm better at it now? Or maybe it will become a PITA and I'll swing right back.

EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D

woodruffw4y ago

Not my company, but I work with another company that does (nearly?) all of their infrastructure on premise. They have pretty great uptime, in a large part because they're not dependent on the 3-4 global state mechanisms that consistently cause outages with cloud providers (DNS, BGP, AWS's role management/control plane, &c.).

I think you're right about what we over- & under-estimate, but that we also under-estimate the inflection point for when it makes sense to begin relying on major cloud services. Put another way: we over-estimate our requirements, causing us to pessimistically reach for services that have problems that we'd otherwise never have.

dgudkov4y ago

1) Can you make your on prem infrastructure go down less than Amazon's?

It's now hard to say how frequently Amazon's infrastructure goes down. The incident rate seems to have accelerated.

StreamBright4y ago

3) Could you hire talent that can build the thing?

In my experience problem number 3 is the hardest to solve.

jtc3314y ago

You’re missing a huge factor: agency.

j / k navigate · click thread line to collapse

0 comments

29 comments · 11 top-level

Nextgrid4y ago· 10 in thread

> Can you make your on prem infrastructure go down less than Amazon's?

> Is it worth it?

I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.

> overestimate how important it is for their services to run uninterrupted.

Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.

laumars4y ago

I have run high availability (HA) systems in prem and your statement vastly understates the difficulty and expense.

And you then need to duplicate all of the above in second location just in case the first location still goes down.

By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.

Nextgrid4y ago

4 more replies

vinay_ys4y ago

2 more replies

jrockway4y ago

> You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre.

tzs4y ago

> You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages.

[1] long duration electrical outages will often take out everything a short distance away, and the petrol stations usually have electric pumps.

2 more replies

phil214y ago

On-prem rarely actually means physically on-prem at any scale beyond a small IT office room. It means co-locating in purpose built datacenters.

A few organizations will find it worthwhile to continue to build at this scale (e.g. Visa, the government) but it's exceptionally small.

1 more reply

H1Supreme4y ago

> You need a back up generator and to be a short distance away from a petrol station

My building has a natural gas backup generator.

1 more reply

badams25274y ago

Human capital side would disagree with that I think. You're assuming the organization which owns this small/medium web app has the personnel already on staff to handle such a thing.

That said, I think you're spot-on as long as you have the skillset already.

Retric4y ago

You still need to pay someone to manage AWS infrastructure. It’s possible to save money using AWS, but things often get more expensive.

1 more reply

Nextgrid4y ago

> Human capital side would disagree with that I think

1 more reply

Retric4y ago· 2 in thread

nostrebored4y ago

Very untrue. Many businesses with 4 9 SLAs are all in on AWS. It requires active/active setups though!

Retric4y ago

Many business claim 4 9 SLAs on AWS, but that doesn’t mean they actually provide it. It’s simply a question of what the penalties of failing to reach their SLA is.

autosharp4y ago· 2 in thread

Also, you can just take two different amazon regions and hope they don't both go down at the same time.

For extra safety, and extra work, you could even take Azure as a backup if you're not locked in with AWS.

dijit4y ago

forgive me repeating myself: AWS Zones are not truly independent of each other.

Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.

If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.

it does not matter if you're in timbuktu-1, you are dead in the water.

it is a myth that amazon availability zones are truly independent.

please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.

autosharp4y ago

Of course that depends on what services you use and yes, even then there is some remaining correlation just because it is the same host.

> are not truly independent of each other

Indeed. They are even on the same planet!

> please stop blaming the victim

Excuse me?

1 more reply

ocdtrekkie4y ago· 2 in thread

My on prem infrastructure goes down drastically less than Amazon's.

...My home Internet even is scoring better than Amazon right now, in fact. Yours probably is too.

lmilcinOP4y ago

I have a bolt lying on my desk.

It hasn't failed since 1970 when it was produced.

It must have been built better than Space Shuttle, then.

dijit4y ago

Sysadmin 102: Simple Systems Fail Less Often.

patentatt4y ago· 1 in thread

sgarland4y ago

jerf4y ago· 1 in thread

Backup is cheap when you're focused about what you're backing up.

In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!

kaashif4y ago

> In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon.

It's possible to go down in a mostly uncorrelated way to Amazon by just being down all the time.

pkulak4y ago

> Can you make your on prem infrastructure go down less than Amazon's?

EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D

woodruffw4y ago

dgudkov4y ago

1) Can you make your on prem infrastructure go down less than Amazon's?

It's now hard to say how frequently Amazon's infrastructure goes down. The incident rate seems to have accelerated.

StreamBright4y ago

3) Could you hire talent that can build the thing?

In my experience problem number 3 is the hardest to solve.

jtc3314y ago

You’re missing a huge factor: agency.

j / k navigate · click thread line to collapse