1) Can you make your on prem infrastructure go down less than Amazon's?
2) Is it worth it?
In my experience most people grossly underestimate how expensive it is to create reliable infrastructure and at the same time overestimate how important it is for their services to run uninterrupted.
--
EDIT: I am not arguing you shouldn't build your more reliable infrastructure. AWS is just a point on a spectrum of possible compromises between cost and reliability. It might not be right for you. If it is too expensive -- go for cheaper options with less reliability.
If it is too unreliable -- go build your own yourself, but make sure you are not making huge mistake because you may not understand what it actually costs to build to AWSs level.
For example, personally, not having to focus on infra reliability makes it possible for me to focus on other things that are more important to my company. Do I care about outages? Of course I do, but I understand doing this better than AWS has would cost me huge amount of focus on something that is not core goal of what we are doing. I would rather spend that time thinking how to hire/retain better people and how to make my product better.
And adding all that complexity of running this infra to my company would cause entire organisation be less flexible, which is also a cost.
So you can't look at cost of running the infra like a bill of materials for parts and services.
And if there is an outage it is good to know there is huge organisation there trying to fix it while my small organisation can focus preparing for what to do when it comes back up.
As a counterpoint, though, my last place had a large Java app, split between colo'd metal and AWS. Seemed like the colo'd stuff failed more (bad RAM mostly, a few CPUs, and an occasional PSU). Entirely anecdotal.
Obviously depends on what you need, but for a small to medium web app that needs a load-balancer, a few app servers, a database and a cache, yes absolutely - all of these have been solved problems for over a decade and aren't rocket science to install & maintain.
> Is it worth it?
I'd argue that the "worth" would be less about immunity to occasional outages but the continuous savings when it comes to price per performance & not having to pay for bandwidth.
> overestimate how important it is for their services to run uninterrupted.
Agreed. However when running on-prem, should your service go down and you need it back up, you can do something about it. With the cloud, you have no choice but to wait.
You need multiple physical links in running to different ISPs because builders working on properties further down the street could accidentally cut through your fibre. Or the ISP themselves could suffer an outage.
You need a back up generator and to be a short distance away from a petrol station so you can refuel quickly and regularly when suffering from longer durations of power outages. You absolutely do not want to run out of diesel!
You need redundancy of every piece of hardware AND you need to test that failover works as expected because the last thing you need is a core switch to fail and traffic not to route over secondary core switch like expected.
You need your multiple air con units and them to be powered off different mains inputs so if the electrics fail on one unit it doesn’t take out the others. I guarantee you that if the air cons will fail, it will be on the hottest day of the year a month amount of portable units will stop your servers from overheating.
You need beefy UPS with multiple batteries. Ideally multiple UPSs with each UPS powering a different rail on your racks so that if one UPS fails your hardware is still powered from the other rail. And you need to regularly check the battery status and loads on the UPS. Remember that the back up generator takes a second or two to kick in so you need something to keep the power to the servers and networking hardware to be uninterrupted. And since all your hardware is powered via the UPS, if that dies you still lose power even if the building is powered.
And you then need to duplicate all of the above in second location just in case the first location still goes down.
By the way, all of the possible failure points I’ve raised above HAVE failed on me when managing HA on prem.
The reason people move to the cloud for HA is because rolling your own is like rolling your own encryption: it’s hard, error prone, expensive, and even when you have the right people on the team there’s still a good chance you’ll fuck it up. AWS, for all its faults, does make this side of the job easier.
At my last job we provided redundant paths (including entry to your building) as an add-on service. So you might not need two ISPs if you're only worried about fiber cuts. You could still be worried about things like "we think all Juniper routers in the world will die at the exact same instant", in which case you need to make sure you pick an ISP that uses Cisco equipment. And of course, it's possible that your ISP pushes a bad route and breaks the entirety of their link to the rest of the Internet.
I don't see why the petrol station needs to be a short distance away. Unless the plan is to walk to the petrol station and back (which should not be the plan[1]), anyplace within reasonable driving distance should do.
[1] long duration electrical outages will often take out everything a short distance away, and the petrol stations usually have electric pumps.
If you are going to the level of the above, you go with co-location in purpose built centers at a wholesale level. The "layer1" is all done to the specs you state and you don't have to worry about it.
On-prem rarely actually means physically on-prem at any scale beyond a small IT office room. It means co-locating in purpose built datacenters.
I'm sure examples exist, but the days of large corporate datacenters are pretty much long over - just inertia keeping the old ones going before they move to somewhere like Equinix or DRT. With the wholesalers you can basically design things to spec, and they build out 10ksqft 2MW critical load room for you a few months later.
A few organizations will find it worthwhile to continue to build at this scale (e.g. Visa, the government) but it's exceptionally small.
My building has a natural gas backup generator.
If you're outsourcing that, you'd likely have to pay a boatload just for someone to be available for help, let alone the actual tasks themselves. Like you said, if you're on-prem and something goes down, you can do something. But you've gotta have the personnel to actually do something.
That said, I think you're spot-on as long as you have the skillset already.
I hear this argument a lot, but every startup I've been involved with had a full-time DevOps engineer wrangling Terraform & YAML files - that same engineer can be assigned to manage the bare-metal infrastructure.
Backup is cheap when you're focused about what you're backing up.
In this case, the game isn't "going down less than Amazon", it's about going down uncorrelated to Amazon. Though that's getting harder!
"In more than one way" doesn't have to be local, but it may be across multiple cloud services. Still, "local" is nice in that it doesn't require the Internet. ("The Internet" doesn't tend to go down, but the portion you are on certainly can.) Of course, as workers disperse, "local" means less and less nowadays.
It's possible to go down in a mostly uncorrelated way to Amazon by just being down all the time.
Obviously this is implicit in your comment, but I'll say it anyway: your backups need to actually work when you need them. You need to test them (really test them) to make sure they're not secretly non-functional in some subtle way when Amazon is really down.
Over the last two years, my track record has destroyed AWS. I've got a single Mac Mini with two VMs on it, plugged in to a UPS with enough power to keep it running for about three hours. It's never had a second of unplanned downtime.
About 15 years ago I got sick of maintaining my own stuff. I stopped building Linux desktops and bought an Apple laptop. I moved my email, calendars, contacts, chat, photos, etc, to Google. But lately I've swung 180 degrees and have been undoing all those decisions. It's not as much of a PITA as I remember. Maybe I'm better at it now? Or maybe it will become a PITA and I'll swing right back.
EDIT: I realize you're talking in a commercial sense and I'm talking about a homelab sense. Still, take my anecdote for what it's worth. :D
I think you're right about what we over- & under-estimate, but that we also under-estimate the inflection point for when it makes sense to begin relying on major cloud services. Put another way: we over-estimate our requirements, causing us to pessimistically reach for services that have problems that we'd otherwise never have.
For extra safety, and extra work, you could even take Azure as a backup if you're not locked in with AWS.
Global services such as route53, Cognito, the default cloud console and Cloudfront are managed out of US-East-1.
If us-east-1 is unavailable, as is commonly the case, and you depend on those systems, you are also down.
it does not matter if you're in timbuktu-1, you are dead in the water.
it is a myth that amazon availability zones are truly independent.
please stop blaming the victim, because you can do everything right and still fail if you are not aware of this; and you are perpetuating that unawareness.
> are not truly independent of each other
Indeed. They are even on the same planet!
> please stop blaming the victim
Excuse me?
It's now hard to say how frequently Amazon's infrastructure goes down. The incident rate seems to have accelerated.
...My home Internet even is scoring better than Amazon right now, in fact. Yours probably is too.
In my experience problem number 3 is the hardest to solve.