The unpredictability of cost is actually the prime reason we stuck to our own cloud, where we rent (technically we buy the hardware that the company hosts, but it’s not really ours, we just use it till it breaks) the iron at a known rate. Which is just better for a public sector budget than paying by mileage, at least if anyone outside of the IT department bothers to look into what they are signing off on.
The really interesting part will be where we go from here. Moving from self-hosted to rented iron that we run our virtual servers on, was a fairly simple move that would be easy to reverse. The move into the cloud is even easier, but unless you’re careful, it could be very costly to get out.
The last 5 years for me has been soul crushing as someone who actually enjoys managing datacenters. We have seen time and time again having your own DC leads to much better visibility and control on spending as well as lower cost. Not to mention the huge advantage when negotiating with cloud vendors if you are a mid size or up company.
So time and time again i have had to transition out of environments you can reason about into AWS and become a glorified support engineer but i guess thats what companies need now days. Someone who will read docs the other engineers dont want to and troubleshoot all the issues because AWS is so easy.
Im glad I got to learn how the “cloud” works though as i likely never would have been drawn to infra and programming in this day and age.
As a developer I've put up with over-subscribed VMware clouds and I vastly prefer the Azure/AWS option.
I hear they might need to move it all to Azure soon!
IT has always had issues with people cargo cutting solutions without understanding the details, and the cloud is no different.
Cloud boxes are insanely expensive (easily 10x the price of the equivalent in house box, taking hosting, power, cooling, hw into account).
To make this work, you need a combination of variable demand, and only paying for partial salaries (your cloud boxes are mutualized with other people's boxes).
If you're a reasonably big company (tens of thousands of servers) , with fairly stable demand, and adequate capacity planning, you won't necessarily save a huge amount of money by outsourcing your DCs. You can argue that the gcp/aws guys are better than you at running fleets of servers and data centers , but at 10x the price, it's worth double checking. If all I do is raw compute 100% of the time on a very large scale, it's extremely likely I want to do it myself.
Obviously, there's more than raw hardware to the cloud, starting with all kinds of managed services, which can be worth it. Again , you'll have to do the maths :10x for the boxes, then extra for the distributed db? Does it give me a competitive advantage? Better time to market?
In the end, there are good use cases, and bad use cases for the cloud, and I don't think it's as clear cut as what you say.
EDIT : if cloud hardware prices were not completely ridiculous (say 2x), then it might suddenly be a lot more compelling, and I would most likely agree with you (security / regulatory issues aside).
It's only unpredictable and unquantifiable if you don't look at it. Is the problem that they don't have the right tools to look at it yet?
The premise seems off to me. Of course people have a hard time predicting the cost of an autoscaling infrastructure that they haven't had for a long time.
Presumably they moved off of a fixed size infrastructure to get to Kubernetes. Where they were either paying for excess capacity on some days, or paying in the form of poor performance when demand exceeded supply.
Five percent accuracy seems like a high bar, and you would want a year or two to understand your seasonality and growth rate, etc.
Seriously, most k8s projects I have been involved with required so much effort to bootstrap and keep it going, it just blew me away! The experience for the average developer was just frustrating and infuriating: AWS ECS to the rescue!
Some will argue: vendor lock in! Really? I bet most services out there are already vendor locked in, just go with the flow and make your life easier.
I have seen companies failing because investing so much in building infrastructure, supposly vendor lock in free (or so they thought) that they lost sight and did not invest enough building the actual product: no revenue -> party is over.
Don't make the same mistake.
If you want to run a complex service consisting of multiple microservices, auto scaling and so on, nothing beats Kubernetes. But you're right, most small businesses just need a simple web site, and for them, an Amazon Lightsail VM might suffice.
IMHO, unless you have a huge fleet of bare metal machines K8s is an overhead you don't need
Most everything else seems to be the same thing as if you ran with EKS, just different names for everything.
Setting it up yourself, though, no, I wouldn't do that unless I had a large enough team to maintain it.
Terraform has served me well, it does come with some pain as well, but nothing compared to k8s.
How was the stock comp for engineers at these companies?
Everything was fine at the beginning, we built very fast a CRUD app in node.js hosted on K8s on Google Cloud, doing something useful for our target market. We demoed with the client and everything was fine.
Then they hired a tech lead from another failed startup as a CTO. He started bringing in all his friends (all from the same country, so the company started having a french crowd and everyone else) from his previous employer as minor exec titles. Stock compensation for the exec was pretty good and they even had a 50k bonus or so I heard.
Slowly but surely they started pushing for our backend to be rewritten in Go for no reason - which I resisted - and then we decided to move to AWS + nomad. This was 2015 and tooling was even more minimal than what Hashicorp offers nowadays - so they basically started building a K8S on top of nomad to replicate the capabilities we had before.
I complained saying that we didn't have time for this with the impending demos and I said it was a stupid decision.
I was promptly fired for that - and without me on the way they threw away our node.js application and started rewriting everything in Go.
They never made another demo in time and the investor eventually pulled the plug from investing in the startup and acqui-hired the company for little money. The founders were pissed and called me later on to update me on how it went and to apologise. Fin.
instead what they really needed was to focus on their customers and build them a useful product that simply worked. They spent months building and deploying their own k8s cluster: EKS was "vendor lock in" so that wasn't a good choice for them, but guess what, all their infrastructure was already running in AWS anyway and their product was already vendor locked in: RDS, S3, etc ...
also ... to make things even more complex, they thought they needed to go all-in super distributed micro services: it took literally months to get new "services" up and running in production. It was a s*t show!
One of the many story of let's break the monolith, embrace microservice, thus k8s ... gone horribly wrong.
Eventually most engs left ....
The trouble with all of this is that it doesn't really account for how the respondents use Kubernetes. What type of workloads are they running and how variable are those workloads? Would the organizations struggling to predict costs still struggle using another solution if their workloads are highly variable? Are they trading fixed costs for scalability in the face of those variable workloads? It's certainly possible to set upper bounds to autoscaling and to run fixed sized workloads in Kubernetes.
Perhaps the best takeaway from the article is that there is an opportunity to develop better cost management tools or offer consulting services in this domain. I know there are a few companies out there hoping to offer services in this space already.
Suggestion for next article -> "Software a black hole of unpredictable spend"
I can't speak to the SRE side; I can imagine the complexity there. But are these challenges greater than maintaining, observing, and modifying a Rube Goldberg machine of managed AWS services?
At a previous place we set up a cluster on AWS. This was before EKS. We started out with kops initially but later used the generated CFN yaml. It was not an easy feat. There was a lot of gotchas, moving pieces and much more. All of these moving pieces had their own gotchas. Plus lots of competition in the area with not a lot of comparisons since it was early days. Many things were not fully stable. A lot of issues. We got there in the end, but it wasn't easy.
On the flip side, we were able to onboard people with their services in days, not weeks (previously the company ran their own datacentres). Teams were allowed to go to AWS directly, or go to our K8s cluster. I was able to observe the lead time of 3 teams, 2 chose k8s, 1 chose AWS. Those going to K8s were able to get their prod environment running within a week. The other team took a month to do their dev environment. All three teams deployed a single stateless service.
This is obviously anecdotal, but I was really impressed with the user friendliness of kubernetes for the consumers.
Nowadays, 4 years later at my current company, it's a different story with every major provider managing the clusters for you. At my current company we haven't taken the jump yet, so unfortunately I can't fully compare, but the little I've played with EKS, it's as easy as simple crud operations.
In all seriousness, what other side is there? The SRE's role is to make sure you never encounter kubernetes. eg... you have a git repo and some branches - if you push to them, deployments magically happen. As a non-SRE, what parts of kubernetes would you actually be touching or interacting with?
I’m not surprised they don’t like it.
It's not hard, you just need the tools (kubecost, etc)
How do you attribute the partial usage of the node? Is it 2 cores billed to pod A, 1 core billed to pod B, and 1 core billed to some random team?
Or do you have 2/3 of Node billed to pod A and 1/3 of Node billed to Pod B.
Now deal with this permutation across all the various variables.
https://cloud.google.com/kubernetes-engine/docs/tutorials/au...
It's still really basic but I'd love to hear your feedback!
We have been hearing this a lot from our customers who use EKS. They are running single clusters as shared infrastructure so have no insight into which workloads are contributing the most costs. This is true with other shared infra like data pipelines.
We are currently working on a solution for pod-level cost insights if anyone is interested in signing up for the beta shoot an email to ben@vantage.sh
1. Companies had critical infrastructure for the success of Kubernetes owned by teams that opposed deploying Kubernetes
2. The primary person shepherding Kubernetes into the company's environment had not done their due diligence on what were appropriate workloads for Kubernetes and what were not and how applications would integrate across mixed environments when required.
3. The principal tech resources at the company were not educated about containerization, Kubernetes, and the intricacies of container networking but were on the hook internally for the implementation.
What ends up driving the "black hole of unpredictable spend" is that companies are sold (either internally or externally) on a relatively short migration timeframe, but that timeframe is contingent on the company having appropriate infrastructure, staffing, and no key persons internally blocking said migration. If any factor is out of whack the migration timeline can quickly approach infinite.
While it is true that there are startups that could run everything they need for their first 10k customers on 5 VMs w/ Nginx & MySQL that decide to build grandiose environments in Kubernetes they don't need. The opposite is also true, which is that there are huge enterprises who could in reality massively benefit from Kubernetes in their environment but for "political" reasons can't get it done even after spending millions of dollars, so are stuck mired in their "legacy" environments. Networking, in particular, is a huge barrier of entry for enterprise Kubernetes deployments and are almost always stymied by people, not technology, because most enterprises have some Boomer network admin who doesn't actually know anything about networking but only knows about Cisco gear running things.
So, what do companies do? They go to AWS or GCP and they just run up a /massive/ bill, as they very very slowly migrate (often rewrite) their legacy systems to the cloud. This is of course astronomically and unnecessarily expensive, but it's generally not the fault of the underlying technology. AWS and Google are happy to bilk major enterprises as well, and often sell them a bill of goods they can't deliver on.