But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.
The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.
I've worked on plenty of teams with relatively small apps, and the difference between:
1. Cloud: "open up the cloud console and start a VM"
2. Owned hardware: "price out a server, order it, find a suitable datacenter, sign a contract, get it racked, etc."
Is quite large.
#1 is 15 minutes for a single team lead.
#2 requires the team to agree on hardware specs, get management approval, finance approval, executives signing contracts. And through all this you don't have anything online yet for... weeks?
If your team or your app is large, this probably all averages out in favor of #2. But small teams often don't have the bandwidth or the budget.
Our AWS account is managed by an SRE team. It’s a 3 day turnaround process to get any resources provisioned, and if you don’t get the exact spec right (you forgot to specify the iops on the volume? Oops) 3 day turnaround. Already started work when you request an adjustment? Better hope as part of your initial request you specified backups correctly or you’re starting again.
The overhead is absolutely enormous, and I actually don’t even have billing access to the AWS account that I’m responsible for.
Now imagine having to deal with procurement to purchase hardware for your needs. 6 months later you have a server. Oh you need a SAN for object storage? There goes another 6 months.
That's an anti-pattern (we call it "the account") in the AWS architecture.
AWS internally just uses multiple accounts, so a team can get their own account with centrally-enforced guardrails. It also greatly simplifies billing.
How many things don’t end up happening because of this? When they need a sliver of resources in the start?
You buy a couple of beastly things with dozens of cores. You can buy twice as much capacity as you actually use and still be well under the cost of cloud VMs. Then it's still VMs and adding one is just as fast. When the load gets above 80% someone goes through the running VMs and decides if it's time to do some house cleaning or it's time to buy another host, but no one is ever waiting on approval because you can use the reserve capacity immediately while sorting it out.
One (the only?) indisputable benefit of cloud is the ability to scale up faster (elasticity), but most companies don’t really need that. And if you do end up needing it after all, then it’s a good problem to have, as they say.
If your load is very spiky, it might make more sense to use cloud. You pay more for the baseline, but if your spikes are big enough it can still be cheaper than provisioning your own hardware to handle the highest loads.
Of course there's also possibly a hybrid approach, you run your own hardware for base load and augment with cloud for spikes. But that's more complicated.
#1: A cloud VM comes with an obligation for someone at the company to maintain it. The cloud does not excuse anyone from doing this.
#2: Sounds like a dysfunctional system. Sure, it may be common, but a medium sized org could easily have some datacenter space and allow any team to rent a server or an instance, or to buy a server and pay some nominal price for the IT team to keep it working. This isn’t actually rocket science.
Sure, keeping a fifteen year old server working safely is a chore, but so is maintaining a fifteen-year-old VM instance!
Having redirected of a vm provider or installing a hyper visor on equipment is another thing.
I worked at a supposedly properly staffed company that had raised 100's of millions in investment, and it was the same thing. VMs running 5 year old distros that hadn't been updated in years. 600 day uptimes, no kernel patches, ancient versions of Postgres, Python 2.7 code everywhere, etc. This wasn't 10 years ago. This was 2 years ago!
But your comparison isn't fair. The difference between running your own hardware and using the cloud (which is perhaps not even the relevant comparison but let's run with it) is the difference between:
1. Open up the cloud console, and
2. You already have the hardware so you just run "virsh" or, more likely, do nothing at all because you own the API so you have already included this in your Ansible or Salt or whatever you use for setting up a server.
Because ordering a new physical box isn't really comparable to starting a new VM, is it?
"Cloud" has changed that by providing an API to do this, thus enabling IaC approach to building combined hardware and software architectures.
Open their management console, press order now, 15 mins later get your server's IP address.
I.e. most hosting providers give you the option for virtual or dedicated hardware. So does Amazon (metal instances).
Like, "cloud" was always an ill-defined term, but in the case of "how do I provision full servers" I think there's no qualitative difference between Amazon and other hosting providers. Quantitative, sure.
I supported an application with a team of about three people for a regional headquarters in the DoD. We had one stack of aging hardware that was racked, on a handshake agreement with another team, in a nearby facility under that other team's control. We had to periodically request physical access for maintenance tasks and the facility routinely lost power, suffered local network outages, etc. So we decided that we needed new hardware and more of it spread across the region to avoid the shaky single-point-of-failure.
That began a three year process of: waiting for budget to be available for the hardware / license / support purchases; pitching PowerPoints to senior management to argue for that budget (and getting updated quotes every time from the vendors); working out agreements with other teams at new facilities to rack the hardware; traveling to those sites to install stuff; and working through the cybersecurity compliance stuff for each site. I left before everything was finished, so I don't know how they ultimately dealt with needing, say, someone to physically reseat a cable in Japan (an international flight away).
You can start with using a cloud only for VMs and only run services on it using IaaS or PaaS. Very serviceable.
If it's your own hardware, if you don't have IaC of some kind – even something as crude as a shell script – then a failure may well mean you need to manually set everything up again.
- Some sort of firewall or network access control. Being able to say "allow http/s from the world (optionally minus some abuser IPs that cause problems), and allow SSH from developers (by IP, key, or both)" at a separate layer from nginx is prudent. Can be ip/tables config on servers or a separate firewall appliance.
- Some mechanism of managing storage persistence for the database, e.g. backups, RAID, data files stored on fast network-attached storage, db-level replication. Not losing all user data if you lose the DB server is table stakes.
- Something watching external logging or telemetry to let administrators know when errors (e.g. server failures, overload events, spikes in 500s returned) occur. This could be as simple as Pingdom or as involved as automated alerting based on load balancer metrics. Relying on users to report downtime events is not a good approach.
- Some sort of CDN, for applications with a frontend component. This isn't required for fundamental web hosting, but for sites with a frontend and even moderate (10s/sec) hit rates, it can become required for cost/performance; CDNs help with egress congestion (and fees, if you're paying for metered bandwidth).
- Some means of replacing infrastructure from nothing. If the server catches fire or the hosting provider nukes it, having a way to get back to where you were is important. Written procedures are fine if you can handle long downtime while replacing things, but even for a handful of application components those procedures get pretty lengthy, so you start wishing for automation.
- Some mechanism for deploying new code, replacing infrastructure, or migrating data. Again, written procedures are OK, but start to become unwieldy very early on ('stop app, stop postgres, upgrade the postgres version, start postgres, then apply application migrations to ensure compatibility with new version of postgres, then start app--oops, forgot to take a postgres backup/forgot that upgrading postgres would break the replication stream, gotta write that down for net time...').
...and that's just for a very, very basic web hosting application--one that doesn't need caches, blob stores, the ability to quickly scale out application server or database capacity.
Each of those things can be accomplished the traditional way--and you're right, that sometimes that way is easier for a given item in the list (especially if your maintainers have expertise in that item)! But in aggregate, having a cloud provider handle each of those concerns tends to be easier overall and not require nearly as much in-house expertise.
I believe it can work, so maybe there are really successful implementations of this out there, I just haven't seen it myself yet!
But when you start factoring internal processes and incompetent IT departments, suddenly that's not actually a viable option in many real-world scenarios.
Did you reply to the right comment? Do you think "politics" is something you solve with Ansible?
It's related to the first part. Re: the second, IME if you let dev teams run wild with "managing their own infra," the org as a whole eventually pays for that when the dozen bespoke stacks all hit various bottlenecks, and no one actually understands how they work, or how to troubleshoot them.
I keep being told that "reducing friction" and "increasing velocity" are good things; I vehemently disagree. It might be good for short-term profits, but it is poison for long-term success.
As always, good rules are good, and bad rules are bad.
Like most people on the internet, you are assuming only one of those sets exist. But you are just assuming a different set from everybody that you are criticizing.