At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.
Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
And what used to be a single guy is now a huge team just adding more complexity. And I'm having a hard time understanding if it's worth it, or just a cost sink.
At that point, if every dev-team brews their own bespoke solution for each service, you're looking at a dozen or two solutions and you'll end up with a nontrivial amount of people just working on deployments and automation of their own "just scp jarfile to server". And if the solution of any team fails and the right person is unavailable, you're suddenly bleeding money because no one knows how to get it back working. Yes, "It's just a java service", but I've dealt with at least 6 ways of "just restart a java service" that isn't default init.d or systemd over the years. And most of them didn't work and had a "Oh yeah I forgot, you also have to.." shortly afterwards.
And then - at least in our line of business - you come to the fun section of "Customer Presales Questions". What's the access management for the server? Leaver/Joiner process? Separation of duties and roles? Patch cycles? Failover strategies? Backup strategies, geo-redundant backups, RTO, RPO, ... Buisness Continuity Plans?
I'd have to clock it, but doing 1 -2 of these 100+ question sheets costs more time than integrating with our platform - if you're used to the question sheets. And then one of us can answer 90%+ of these questions based on the standards of the platform.
I think that's the heart of the issue. At a certain point, people other than the application developers bear responsibility for the operation of the application.
The situation is not _fundamentally_ different: when it's 1-5 devs, everyone is responsible for the uptime. If one of those devs doesn't know how to deploy, you're still in the same boat as a larger team when something breaks.
But the situation becomes _politically_ different: management decides there needs to be an additional layer of responsibility: the platform team. They are the backstop. And so, instead of the platform team agreeing to bear responsibility for N different bespoke systems, they prescribe some common API (kubernetes, etc).
Of course they COULD prescribe "scp to a server and restart systemd unit" but then that's just not flexible enough. Some teams want different restart strategies, and it spirals out of control. At least kubernetes supports all that flexibility in a well-documented, battle-tested package.
So, the parties to blame are two, with different degrees of culpability:
1. Management decides there needs to be more shared ownership (otherwise you bleed money, which is worse when you're bigger and make more money)
2. Platform team agrees to support kubernetes, because OF COURSE our developers need all the bells and whistles, and elastic beanstalk isn't good enough because what if we need* Feature X 3 years from now?
With an ec2 instance, how do you, for example, update the Java version? Store the database password? Add URLs the service is served at? If it’s done manually how do you add a second instance or upgrade the os?
Though, I agree the infra setups are usually overly complicated, and using a “high-level” service such as Heroku or one of its competitors for as long as possible, and even longer, is usually better, especially for velocity.
Literally 100's of ways to do it.
All this IAC and yaml configs and K8 are exactly like DI and IOC. You get sold on "simple", you start implementing it, and every single hurdle only has one answer: Add more of it or add more of this ecosystem into your stack (the one you just wanted to dip your toes into).
Before you know it, everything is taken over and your whole stack is now complicated, run by 50 different json yaml configs, and you now need tooling and templating to get it all working or to make one tiny change.
What about when the one dev deploying their jar to an ec2 instance moves on to another company, how does the next dev even understand what this jar stuff is when they just want to push their SPA to vercel.
Allowing devs to do whatever they want works at small new orgs but you need to put some kind of shape on it as they grow.
I think that's more a problem of one person vs a team, as opposed to deploying JARs on EC2s vs cloud and a bunch of custom tooling.
If you have a team of devs and they all deploy JARs to EC2, one of them leaving won't be a problem, the rest will still know how to do it. If you were to have a single platform engineer who's built a bunch of custom tooling over a bunch of Kubernetes files, and nobody knows how it works or where the files are, and then they leave, you've got the same problem as the solo EC2 dev leaving.
But so many orgs want to believe they are Google scale and you’re stuck with premature teams such as Platform Engineering. Then it just explodes and suddenly you have a director of platform Engineering and multiple sub teams and suddenly OKRs, ADRs, RFCs, team charters and perpetual Kubernetes upgrades and nobody can question its existence any more
And then you: - deploy the wrong jar
- someone else overwrites your jar
- the jar and its libraries has a vulnerability (e.g. log4j)
- the jar doesn't support the Java version installed on the EC2 instance
- the EC2 instance isn't patched and gets hacked
- the EC2 instance generates lots of logs, fills up the disk and crashes
... and many more
There are reasons some things are put in place. It's not just a sunk cost. Is insurance just a sunk cost? There's always a risk factor to it.
Platform doesn't need to be a huge team or lots of complexity. Lots of things can be automated but you still need to cater for important essentials.
PE teams are more necessary when your development teams grow to include people who don't know infrastructure or when your compliance and security requirements need to scale passed most developers knowledge. At that point the overhead and abstractions are worth it.
> Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.
Generally speaking, on every platform team I've worked on we've been able to maintain the ability of developers to continue to interact with the raw infrastructure as code. That's a careful dance done solely for the benefit of power users. Not every PE team knows this lesson though.
We have probably 50-70 teams and well over 2000 deployable products.
There are good things, for sure. But of five teams, we’re the only one of them that is focused on the ‘customer’ (application developers).
The devops/infra teams provide ways for AppDevs to build what they need, but there seems to be no good abstractions being made.
Our team is named and presented as a team that provides common libraries, templates, ‘golden paths’, etc. But then the reality is we have barely any time for that. Instead we get tasked with projects that are indeed important from a $$$$ perspective, but it doesn’t fit well into an existing capability team.
Which is fine, but it feels dishonest to the rest of the engineers that are using our products thinking that’s the main thing we do.
Doesn’t sound like there’s a whole lot of platform engineering here.
The general aim for a platform engineering team should be a UI/CLI that allows a dev to get a new service into production in minutes. Metrics, tracing, monitoring, alert routing, logging, DB, CI/CD, service/RPC stubs etc all done for you so you can get to writing code fast and not worry about the tower of complexity underlying all of this.
Adoption of products that enable one particular flavor of org structure that is shown to be useful won't help you if you don't adjust your org structure.
Without adoption of practices that take advantage of those shifts in complexity you will never receive the full benefits from them.
I highly encourage you to research the reasons behind these shifts in complexity, particularly the ways they are intended to increase independence of teams to increase organizational scaling ability.
Empathy is another area to perhaps work on. Because I was that 'single guy' for years and guess what, I missed every family graduation, wedding and funeral for a decade; had months where I had to wake up and restart you jboss instances every 45min 24 hours a day and then still had to be in the office at 9am for meetings where you would punt a ticket to next sprint to fix it.
Platform engineering done right is like an internal SaaS provider, and you should have embeds to help you with interfacing with them. Abstraction to allow for vendor mitigation using tools like Terraform is a good practice but not super custom.
But you can be bitter and complain, working for a company who chooses solutions on the golf course; or you can figure out what your org needs, find sponsors and allies and make positive changes.
If you feel the platform group is adding hurdles, you aren't working at a place that is doing platform engineering or you are understanding why some requirements need to be implemented.
That said as all cloud providers are SoA based, large amounts of custom tooling is a red flag that your org is not SoA based and you are going to have a bad time anyway.
As for manifests and egress, if you didn't care about them before, you probably were releasing insecure, unreliable balls of mud. So yes there will be an adjustment to becoming more of a software engineer. But that is just the reality of working in larger systems on more professional teams.
If you dislike a SoA model, ITIL and ITSM would have been true hell.
The company wide meetings to add basic services were way more subject to bike shedding and blocking in those days.
Anyways being proactive and helping a company alignment with modern practices doesn't happen by itself. If you are passionate about this, run with it.
For some services, copying a jar file is a completely valid pattern in SoA FWIW, and adding complexity without value is an anti-pattern.
Just like with programming, it is easy to forget you aren't the customer and to implement features that your customer don't want or need. Assuming that you don't have a reputation for being a Karen, try communication with the platform team and let them know what your pain points aren't. If they won't let you in the room make your case to someone who will sponsor you to get in the room.
But first realize that having a single person be a single point of failure, then expecting them to sacrifice their entire life to be on call for an entire organization simply isn't realistic.
I wish I hadn't lost two long term relationships and decades of family time doing it. I didn't do me or the companies I was at any favors.
Give me a blobstore and some VMs and I'll be happy, and if someone raises the slightest fuss I will point out that it's still years better than helm.
Isn't this just a file system?
helm has a tendency to sprawl really fast. the ability to template any part of a manifest can lead to “if there is even a chance the default will ever be changed we should make it a template” (at my company at least). when your entire chart is “if values.thing X, else Y”, it makes it difficult to be confident about what exactly will be deployed. especially if you only have a chart and values can be inserted in many places (helmfile). without helm template it’s sometimes impossible to reason with all the meta-templates.
speaking of debugging, helm template sometimes has very awful debug output and traces of where your error is. in more simple charts it’s not so bad. with library charts you can spend hours debugging, say, your deployment.yaml because it’s saying there’s an error on line 4 of the template, but line 4 is just some label or benign key, and oh finally i figured out that the deployment expects the configmap to be rendered, and the typo is in line 4 of the CM, not the deployment
helm lint doesn’t catch duplicate keys? i found that one weird
tools like argocd don’t like helm too much. yeah you can make it work, but flat yaml is so much simpler. and if you use helm + argo, it turns the helm into flat yaml anyway
i have heavy dislike of the syntax of ‘and values.thing1 values.thing2’ in a templated line. (same with ‘or’, though this might just be a personal hangup)
finally, to be positive, the one thing i really like about helm actually is the checksum function that changes a label on the deploy/sts/ds when you change config, so pods automatically roll over.
Ops folks can barely reverse a string. We should not exist as a field. Every company regardless of scale should just use 4 manually provisioned EC2 instances. If that doesn’t work your code is shit and you aren’t using Elixir / Phoenix! Anyone even attempting to use Kubernetes should be branded with the Hetzner logo and forced to work at McDonalds.
Fear not HN. McDonalds here I come.
Just last week there was a thread about a company with 250 Petabytes in s3 and every comment from a developer was how it should be moved in-house. People were linking to one piece of s3-compatible software that the only example of a live deploy was single-digit gigabytes.
Your experience might be different of course, but that doesn’t invalidate the negative experiences of others.
You should find yourself more capable in such an organization, but that is a tall order in today's world. Almost every problem you will face has been solved and help is easy to find even on your own. So being in such an organization should make you more capable of solving problems than your base ability. Someone must have already solved that problem for you and the solution must be in easy reach. That's still a tall order, most solutions you find described on the internet are also in easy reach nowadays.
Your platform can also make people less capable. It is easier to be a force multiplier with a coefficient less than one. Make it more difficult to solve problems. Require them to communicate with many teams in order to accomplish anything. Make the solution difficult to integrate and operate and highly non-standard. If the solution is non-standard and the documentation is anything short of impeccable, it will basicly be unlearnable.
The book Accelerate and the research backing it indicates that the most productive teams are able to redesign their entire system without talking to anyone and to complete their work without communication and coordination to other teams. Effective platforms advance that. They make it easy for people to redesign their systems to implement good designs. Bad platforms reduce that capability beyond what a person would ordinarily have and user complaints are brushed off with a tautological, "You can't make everyone happy." Bad platforms narrow possibilities and then require product teams to beg platform teams to allow ordinary designs that were cut off.
I'm on a platform team and I struggle sometimes trying to figure out if we are helping people or hurting them hence this philosophical rant.
In my experience it's mostly automation and observability. Delegating as much stuff as possible on tools as opposed to humans, empowering the latter to fix things.
It wasn't obvious to me either, so I found the following definitions. I still think that the article doesn't say 'how PE works' as opposed to providing a bunch of tips.
"The engineering platform is created and maintained by a dedicated product team, designed to support the needs of software developers and others by providing common, reusable tools and capabilities, and interfacing to complex infrastructure." [0]
"Platform engineering is the process of designing and implementing toolchains that improve the software delivery experience. Platform engineers set up automated infrastructure and self-service controls that allow developers to work more efficiently." [1]
"Platform engineering is the “discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an ‘Internal Developer Platform’ covering the operational necessities of the entire lifecycle of an application." [2]
[0] https://www.gartner.com/en/articles/what-is-platform-enginee...
[1] https://www.howtogeek.com/devops/platform-engineering-vs-dev...
[2] https://humanitec.com/blog/sre-vs-devops-vs-platform-enginee...
The ideas presented here are a good summary of the day-to-day issues I experience as a "DevOps engineer." I wish I had the pull to instigate a shift in mindset towards a more platform-oriented mode of work, but unfortunately for me, that is not the case. Maybe it's wishful thinking, but forwarding articles and presentations like this to senior directors can be a way to influence the direction of their thinking to align more with what you believe to be in the best interest of the organization.
Every time I have been on a team where that was not the case, the result was bloated complexity and more failures in production. Usually because the platform/deployment/operations team was composed of not very skilled developers trying to justify their salary to management.
If you have worked with a platform/deployment/operations team that was competent and reduced complexity then great for you. I have yet to experience such a phenomenon. Please let me know in comments why you think your experience was different.
so… move fast and don’t break things?
must be nice to never have tradeoffs