As a software consultant myself, I'd probably stop the conversation right there and ask why they are building such a robust distributed system — SQS, SNS, etc — without any customers. Still want to be deployed in AWS? Toss the damn app on a single EC2 instance...
Kubernetes has its value even for small scale workloads like that, but it’s still a few steps more than, say, running a Capistrano script to push your code to a small Linux box with a database on a second one.
You’ll get really far on minimal resources these days, especially with cheaper ARM boxes that offer far more bang for your buck. Paying 1k+ a month to AWS/GCP/Azure is total insanity when you’re not even averaging a single active user a day.
It absolutely can be, sure. But solutions like Vercel, Cloudflare Workers, Supabase, etc. can be excellent and inexpensive for those use cases.
That’s just not a realistic or necessary approach for everyone.
AWS is engineered for excruciatingly detailed billing right down to the moment you’re consuming or releasing capacity, and that’s how they built it. Managing that spend is exhausting.
My business runs on under $200/mo in Linode compute resources and the performance is significantly better than on similarly situated EC2 instances. We were spending that on databases alone with AWS and getting a fraction of the performance.
I make extensive use of “pure” Linode Kubernetes Engine k8s. It’s portable to any other Kubernetes cluster, and it lets me take my stack _anywhere_, even to a rack in the nearest data center willing to rent me space, if I really wanted.
If you're outsourcing operations to AWS or whomever, a couple largish instances and a couple supporting services can get you pretty much that same thing, for a bit more money and a bit less control over performance-consistency.
All that HA/scaling/clustering/cloud stuff is expensive, not just in monetary terms, but in performance terms. If you don't actually need it, a high percentage of your compute & (especially) your network traffic may be going to that, rather than actually serving the product. It also adds a hell of a lot of complexity, which comes at a significant time-cost for development, unless you want your defect rate to shoot up.
> But if more developers just learned how to make a website on linux, with a db, a webserver, and an application.
And hell, nothing's stopping you from writing 12-factor apps and deploying containers, and scripting your server set-up and config, even if you don't go straight for heavy, "scalable" architecture. Even if your server's a beige Linux box in a closet. Enough benefits that the effort's probably a wash at worst (hey, documentation you can execute is the best documentation!) even if you never need to switch architectures, and then you'll have a relatively easy time of it, if you do end up needing to.
they also had some rabbitmq-on-k8s system going that fell over during small tests because they couldn’t get k8s to actually scale it. (which then convinced them they needed k8s, and bigger nodes)
sigh
Back in the day, it would have required a whole procedure to buy that hardware, have it set up, etc. Now you can needlessly spend $10k per month with just a few clicks!
To be honest I wasn't hired to challenge their entire setup, only to make it more cost effective.
So I chose the most straightforward way I could think of that would allow us to come up with a cost effective setup that will be scalable, fault tolerant and simple to maintain later on.
It all probably started with such a single instance running Docker compose, but then over time it evolved into this setup.
The ideal setup I mentioned would have been also cost effective, scalable and resilient.
That's baffling to me, but that perspective is out there too.
I think this is one of those things that really depends on the use case. If they are performing expensive inference, I think having any queue is better than no queue. Going from a synchronous system to an asynchronous one is not easy and it's not something you would want anyone to be paged for once it starts to matter. Getting SQS/SNS up and running now could be a couple hours of work today and is practically free if your traffic is low.
Similarly I have a number of side projects that run extremely cheaply just using ECS and Fargate. I don't even think about Kubernetes really, it's just a PaaS to me that I'm shipping ARM binaries to. As a result I don't think very hard about autoscaling, failover, load balancing or deployment. A github action just pushes master to ec2 and everything "just works".
One is a queuing service, the other one is a VM.
So instead of using SQS that has $0 cost when there are no customers, you suggest I install, configure and run RabbitMQ on an EC2, to save $0 when there are no customers?
Or save $1 when I have 100 customers? SQS is dirt cheap.
The point of SQS or any other usage-based AWS _developer_ service compared to DIY is that you can be up and running in minutes at a minuscule cost.
I agree with you about over-engineering and building a distributed "microservices" architecture when you have no customers.
But I'll pick SQS any time of the day when I need queueing functionality to increase my developer velocity so I can focus on building value rather than wasting my life installing, configuring and running anything on EC2.
> when I need queueing functionality to increase my developer velocity so I can focus on building value rather than wasting my life installing, configuring and running anything on EC2.
SQS still requires configuration, which means you either need to use the (terrible) AWS console UI or spin up a whole Terraform/CloudFormation/CDK/etc stack, not to mention that merely connecting to it requires correctly setting up AWS IAM (so you don't use a key that gives access to your entire AWS account). Vim'ing the RabbitMQ config file in contrast doesn't seem so bad, and even just using a static hardcoded password means the worst an attacker can do is take down your queue instead of taking over your entire cloud infra.
I do think ddb and lambda hit a sweet spot for costs on ramping up. The rest, though, really struggle.
Elsewhere in the comments, there’s a suggestion that this kind of thing isn’t appropriate for “hobby projects” and early stage but I disagree. Those are the times when you really want something you can step away from without doing a disservice to your customers (i.e. letting packages go out of date and get vulnerable) and cost you as little as possible in a steady state so you can focus on acquiring customers and not worrying about fuddling around with the guts.
The ideal trade off is a single Kubernetes cluster with as much in the cluster as makes sense for the team and stage of the project. As you say, toss the app on a single node to start, but the control plane is tremendously valuable from on the onset of most projects.
A startup that outgrows an EC2 server will be making enough money to hire more people to scale the system properly than what was initially designed: trading away everything for development velocity.
Kubernetes is not the right tool for this startup. Kubernetes is what large, old-school non-tech companies use to orchestrate resources, because it’s easier to find someone that “knows k8s” (no one knows k8s unless they’re consulting) than it is to find someone that can build properly distributed systems (in the eyes of whoever is in charge of hiring).
Disney: We'd like to launch a new streaming service.
Consultant: Great! You have no customers right now so you can run it on a singleton EC2 instance until you outgrow that scale!
Disney: ...We expect 20 million people to sign up in the first week
I'm pretty sure "follow the forecast" is exactly what motivated that post.
In other words, the infrastructure is overkill for the initial forecast of customers.
They're not working for Disney.
Your comment is really pretty ignorant of how these tools interact. Using serverless primitives is the opposite of leaving nodes running for no reason.
It's not really surprising that AWS's K8S setup isn't great, and their own implementation ties in more closely with other services they offer. It's lock-in. AWS provides just enough K8S to tick the box on a spec sheet, but have little incentive to go beyond that.
You can do everything from the CLI with kubectl of course, but there are also a bunch of apps that will work with any K8S cluster:
https://medium.com/dictcp/kubernetes-gui-clients-in-2020-kub...
It's very nice to have a consistent interface across multiple cloud providers.
> The team didn't have much DevOps expertise in-house, so a Kubernetes setup, even using a managed service like EKS, would have been way too complex for them at this stage, not to mention the additional costs of running the control plane which they wanted to avoid.
The control plane cost makes sense, but I can't imagine learning Terraform to set up ECS is that much easier than learning Yaml to configure k8s. Unless EKS is much harder to use than GKE.
Eventually EKS was built to satisfy customers that insisted these issues were just FUD from aws to lock customers into the aws infrastructure. However what I have seen since is a basic progression of: customer uses k8s on prem, is fanatical about its use. They try to use it in aws and it’s about as successful as on prem. Their peers squint at it and say “but wouldn’t this be easier with ECS/fargate?” K8s folks lose their influence and a migration happens to ECS. I’ve seen this happen inside aws working with customers and in three megacorps I’ve worked on cloud strategies for. I’ve yet to encounter a counter example, and this was sort of what Andy predicted at the time. I’m not saying there aren’t counter examples, or that this isn’t a conspiracy against k8s to get your dollars locked into aws.
On standards Andy always said that at some point cloud stuff would converge into a standards process but at the moment too little is known about patterns that work for standards to be practical. Any company launching into standards this early would get bogged down and open the door to their competitors innovating around them and setting the future standard once the time is right for it. Obviously not an unbiased viewpoint, but a view that’s fairly canonical at Amazon.
I mean..the customers are not wrong.
I do think that as organizations grow, the ability for components to be defined in smaller units without being enmeshed in a big-ass tf dependency graph is a big draw of the controller model. The flipside is this comes with accepting the operational overhead of k8s plus the attendant controllers/operators you're running and hiring/staffing accordingly. There are ways you can structure your terraform that avoids creating the tight coupling some folks don't like where you have to literally define the entire universe to change a machine image. Not to mention, there do exist tools that allow you to inspect and visualize tf state.
Right now, Terraform maximalism requires reproducible builds, which is not something most orgs can achieve.
Citation needed.
K8s has a whole bunch of footguns that people who don't want to manage infra can easily blunder into.
Terraform and ecs is not immature, and its fairly simple to maintain especially if they are just pushing updates without significant infra changes. (ie bumping the container version)
> Engineering time is expensive
which is why ECS is probably better, because its good enough for running a few containers that talk to a load balancer.
They will continue to make it more appealing to lock your software into their platform than to go with their thinner facilities for OSS, doing the minimum to keep up to date with trends in open source, just enough to lure you in and create “easier” paths until you can’t afford to leave.
We have this problem with Azure - sure it’s easier to get a knucklehead to push buttons and get an app running, but after years you’ll be scrambling to reduce costs. Good luck with that when all of your terraforms use Azure Resource Manager and all of your source code uses Azure Functions. Being stuck with microsoft/amazon and a team of engineers who spent their time learning vendor-specific skills instead of the open source tech that enables it, sounds awful.
Hahahaha
> ECS is also relatively simple and not so far from their Docker-compose setup, but much more flexible and scalable. It also enables us to convert their somewhat stateful pets to identically looking stateless cattle that could be converted to Spot instances later.
Have you ever built something in ECS? I have, and it is missing HUGE SWATHS of the convenient functionality that EKS provides. It lacks the network effect of being a widely-used product, so searching for issues is a constant issue. It breaks and nobody knows how to help.
"Not far from their docker-compose setup..." What are you even talking about? ECS is massively more complex than docker-compose and the main similarity I see between them is that they both run docker. It's similar to docker-compose if you ignore the fact that you need permissions, load balancers, networking, etc. Which is the hard part, NOT running some containers on EC2, by the way.
It has it's own bizarre and verbose container deployment spec that is less portable, less flexible, less feature-ful, and less widely used than EKS.
> ECS will also offer ECS container logs and metrics out of the box, giving us better visibility into the application and enabling us to right-size each service based on its actual resource consumption, in the end allowing us to reduce the number of instances in the ECS cluster once everything is optimized.
Something you also get with EKS. So half of the reasons you have claimed ECS was the right choice are now in the garbage.
What you DON'T get with ECS is awesome working-out-of-the-box open source software like External Secrets, External DNS, LetsEncrypt, the Amazon Ingress Controller, argo rollouts, services, ingresses, cronjobs... I could go on and on.
They are going to try and hire DevOps engineers, and they will all have to ramp up ( and likely complain about ) ECS instead of having people walk on already prepared and ready to start implementing high quality software on a system they already know.
The AWS ecosystem has much of this baked-in. (Parameter Store, Certificate Manager, etc) Vendor lock-in is of course a concern, but for many, a theoretical one.
If you can choose an option that is going to be way less work even if it's "more complex" that is often the right choice as long as you understand what that complexity is and can pierce through the covers if necessary.
ECS is a deployment tool. Kubernetes is a dev-to-ci-to-prod tool, providing same environment for standard workload specs across the full development cycle, and a single way to inject common features into the standard workloads.
- Setting up certs (managed as TF) - Setting up ALBs (managed as TF) - Setting up the actual service definition (often done as a JSON, that is passed into TF)
Possibly other things I'm forgetting.
Some other things. It requires a *developer* to know about certs and ALBs and whatever else.
With EKS, this can all be automated. The devops engineer can set it up so that deploying a service automatically sets up certs, LBs etc. Why are we removing such good abstractions for a proprietary system that is *supposed* to be less management overheads, when in reality, it causes devs to do so much more, and understand so much more?
When I was at Rad AI we went with ECS. I made a terraform module that handled literally everything you're talking about, and developers were able to use that to launch to ECS without even having to think about it. Developers literally launched things in minutes after that, and they didn't have to think about any of those underlying resources.
A major benefit of k8s that is usually massively overlooked is it's RBAC system and specifically how nice a namespace per team or per service model can be.
It's probably not something a lot of people think about until they need to handle compliance and controls for SOC II and friends but as someone that has done many such audits it's always been great to be able to simply show exactly how can do what on which service in which environment in a completely declarative way.
You can try achieve the same things with AWS IAM but the sheer complexity of it makes it hard to sell to auditors which have come to associate "Terraform == god powers" and convincing them that you have locked it down enough to safely hand to app teams is... tiresome.
Why does the developer need to care about the certs and ALBs? The devops engineer you need to set up all those controllers could as well deploy those resources from Terraform.
As I showed in the diagrams from the article this application has a single ALB and a single cert per environment and the internal services only talk to each other through the rabbit MQ queue.
DNS, ALB and TLS certs could be easily handled from just a few lines of Terraform, and nobody needs to touch it ever again.
With EKS you would need multiple controllers and multiple annotations controlling them, and then each controller will end up setting up a single resource per environment.
The controllers make sense if you have a ton of distinct applications sharing the same clusters, but this is not the case here, and would be overkill.
Welcome to reality, where this is not the case.
I'm currently working at a company where we're using TF and ECS, and app specific infra is supposedly owned by the service developers.
In reality, what happens is devs write up some janky terraform, potentially using the modules we provide, and then when something goes wrong, they come to us cos they accidentally messed around with the state or whatever. DNS records change. ALB listener rules need to change.
Honestly, if they had said: "So instead we set up some bare-metal EC2 instances" I would be on-board.
It was definitely not about being contrarian but about offering first and foremost a more cost effective but still relatively simple, scalable and robust alternative to their current setup.
They have a single small team of less than a dozen people, all working on a single application, with a single frontend component.
Imagine instead this team managing a K8s setup with DNS, ALB and SSL controllers that each set up a single resource. I personally find that overkill.
Acme corp loves containers as much as everyone else. Containers provide great value. However, muddling around with docker/containerd/crio without some form of orchestration is just another path to a herd of fragile, neglected pet machines.
Acme corp is very different from the Big Tech world k8s came from. Acme corp doesn't have Linux kernel contributors and language developers and an IT payroll so large that the mundane devops people are lost in the noise. Acme corp must use what prevails and doesn't mystify. The "team" managing something is frequently one person, or less.
Acme corp ends up with a collection of pet VMs, all different. Lots of stuff is containerized. Some stuff isn't. Much of it is high-value: let one of those go down and an angry so-and-so will be on the horn right now, even if they haven't noticed for weeks. Most of it is low load: there will never ever be a world where these get reworked into scalable, stateless, distributed cloud apps.
How to get from a herd of pet VMs that happen to run containers (sometimes) to an orchestrated cluster of containers?
In my imagination the answer is something that looks like a mashup of Proxmox and docker-compose. It has the following features:
-- Orchestration: micro-VMs running containers scheduled across a cluster of nodes. The "micro-VM" term deserves some definition. I don't have a precise definition. I know Firecracker is too anemic and full featured VMs are too much. The micro-VMs of cloud-hypervisor are just about right. Above all "micro" just means simple, not necessarily small: a micro-VM that needs a lot of RAM and takes longer then 0.0003 us to start is fine.
-- Live migration: low-load, high-value applications need to stay up despite cluster node maintenance and despite never becoming candidates for re-engineering into cloud native applications. This feature is the #1 reason the VM part is necessary: live-migration is a native capability of KVM et al. that works well since forever, whereas containers (CRIU not withstanding) can't be live-migrated.
-- Trivially simple support of network transparent block storage: iSCSI and other network block storage is rampant at Acme corp because it's cheap, reliable, easy and fast enough. Re-engineering everything for dynamodb or whatever isn't an option. Fortunately, because we're running a micro-VM with its own kernel that has native support for network block (The other #1 reason for the VM part) we get this for free.
-- Simple operation: if it imposes a bunch of concepts that one can't already find in docker-compose it's wrong. Acme corp doesn't have the depth to deal with more and can't find that depth even if it wanted to, which it doesn't. Grug Brained Devops: not stupid, just instinctually uninterested in unnecessary abstraction, opaque jargon terminology, overengineering and fads.
Anyhow, that's my sincere attempt to answer your question. Respectfully, if you think you know of a solution you're likely wrong: I've wormed into every corner of that which prevails and it doesn't exist at the moment. That's why I claim there is an opportunity. I'm happy to be proven wrong, but you'd have to go a long way.
(disclaimer: I’m part of the team)
They introduced Terraform and dropped docker compose in favour of some Amazon proprietary container scheduler?
1 - It's simpler thank K8s, but not that much simpler than your avg managed K8s offering
2 - It really locks you in the AWS ecosystem
3 - It is way less used than K8s or just running things on servers, so there are way less help / learning resources
I really don't see how using ECS is much better than EC2 + compose for small setups and this post didn't provide many good arguments to convince me.
I'd use it on day 1 (over EC2 + compose) just to avoid managing an OS or deployment infrastructure.
the bar for being "locked in" seems to drop further every day.
At work we use ECS Fargate, Aurora MySQL and Bitbucket pipelines to host a little over 100 client web applications. It takes about an hour to configure a new AWS account and staging/production environments for a new client using Cloudformation (and a number manual steps) and the monthly AWS cost is around $100. There are cheaper ways and probably easier ways, but we feel like we have reached a good balance between stability, ease of use, cost and features. And we are not that worried about being tied to AWS.
Sub $15/mo to run your thing until you get real demand, yeah. But its not new, the K8S shtick is coming from investors not tech people. And if its coming from the tech people throw them out of the door.
Why are you cooking for 8000 people when 6 are coming over? Why are you building a kitchen to cook for 8000 people. Why are you renting space to fit 8000 people.
You need a table and maybe 6 chairs who knows they might eat standing.
Not necessarily. If you need to deal with many containerized apps that are updated and deployed regularly, k8s is a really great tool.
As a rule of thumb, I'd say < 5 - no, > 20 - yes, and everything in between - up to you.
Place I worked at had a service running on K8s with, I think, 4 pods, and it got on average one hit every 2-3 seconds during office hours (and virtually none outside those.)
I think it got the HN hug of death
unfortunately this is a deal with the devil for vendor lock-in