How platform engineering works (opens in new tab)

(chadxz.dev)

112 pointschadxz3y ago49 comments

49 comments

38 comments · 12 top-level

matsemann3y ago· 15 in thread

This may be overly negative to a whole field, but I sometimes feek the platform teams add more hurdles than "stability and velocity".

At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.

Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.

And what used to be a single guy is now a huge team just adding more complexity. And I'm having a hard time understanding if it's worth it, or just a cost sink.

tetha3y ago

Coming at it from the other side, we have 20 - 25 Java teams deploying some 50 - 70 different java services in a number of software environments. For the removal of a number of systems on the older infra, we're expecting another 10 - 15 services being onboarded to the platform.

At that point, if every dev-team brews their own bespoke solution for each service, you're looking at a dozen or two solutions and you'll end up with a nontrivial amount of people just working on deployments and automation of their own "just scp jarfile to server". And if the solution of any team fails and the right person is unavailable, you're suddenly bleeding money because no one knows how to get it back working. Yes, "It's just a java service", but I've dealt with at least 6 ways of "just restart a java service" that isn't default init.d or systemd over the years. And most of them didn't work and had a "Oh yeah I forgot, you also have to.." shortly afterwards.

And then - at least in our line of business - you come to the fun section of "Customer Presales Questions". What's the access management for the server? Leaver/Joiner process? Separation of duties and roles? Patch cycles? Failover strategies? Backup strategies, geo-redundant backups, RTO, RPO, ... Buisness Continuity Plans?

I'd have to clock it, but doing 1 -2 of these 100+ question sheets costs more time than integrating with our platform - if you're used to the question sheets. And then one of us can answer 90%+ of these questions based on the standards of the platform.

ttymck3y ago

> and automation of their own "just scp jarfile to server". And if the solution of any team fails and the right person is unavailable, you're suddenly bleeding money because no one knows how to get it back working.

I think that's the heart of the issue. At a certain point, people other than the application developers bear responsibility for the operation of the application.

The situation is not _fundamentally_ different: when it's 1-5 devs, everyone is responsible for the uptime. If one of those devs doesn't know how to deploy, you're still in the same boat as a larger team when something breaks.

But the situation becomes _politically_ different: management decides there needs to be an additional layer of responsibility: the platform team. They are the backstop. And so, instead of the platform team agreeing to bear responsibility for N different bespoke systems, they prescribe some common API (kubernetes, etc).

Of course they COULD prescribe "scp to a server and restart systemd unit" but then that's just not flexible enough. Some teams want different restart strategies, and it spirals out of control. At least kubernetes supports all that flexibility in a well-documented, battle-tested package.

So, the parties to blame are two, with different degrees of culpability:

1. Management decides there needs to be more shared ownership (otherwise you bleed money, which is worse when you're bigger and make more money)

2. Platform team agrees to support kubernetes, because OF COURSE our developers need all the bells and whistles, and elastic beanstalk isn't good enough because what if we need* Feature X 3 years from now?

3 more replies

tacker20003y ago

I think it depends on the size of the org. If there are only 1-5 devs, then yes, they would be doing the devops. But in OPs case, he is managing the infra for 70 engineers, there needs to be some formality in place, otherwise everything will spin out of control, if every engineer rolls his own server there that would quickly lead to chaos.

n_e3y ago

> At places with basically no platform team, no advanced cloud setup etc, I as a dev could understand everything, and deploying mostly meant getting my jar file or whatever running on some ec2 instance.

With an ec2 instance, how do you, for example, update the Java version? Store the database password? Add URLs the service is served at? If it’s done manually how do you add a second instance or upgrade the os?

Though, I agree the infra setups are usually overly complicated, and using a “high-level” service such as Heroku or one of its competitors for as long as possible, and even longer, is usually better, especially for velocity.

zo13y ago

You stop your service, do apt-get update java, and then start it again? New URLs, update your nginx config file and restart nginx. Second instance? Dunno, provision a VM, ssh into it, FTP the jar over and stick a load-balancer in front of the two. When you get to 3 instances, we can maybe talk about a shell script to automate it. Heck, before we do that, we can just flash an image of the VM and ask EC2 to start another one up.

Literally 100's of ways to do it.

All this IAC and yaml configs and K8 are exactly like DI and IOC. You get sold on "simple", you start implementing it, and every single hurdle only has one answer: Add more of it or add more of this ecosystem into your stack (the one you just wanted to dip your toes into).

Before you know it, everything is taken over and your whole stack is now complicated, run by 50 different json yaml configs, and you now need tooling and templating to get it all working or to make one tiny change.

2 more replies

reillyse3y ago

Sounds a bit like those places aren’t at the scale where a platform team makes sense.

What about when the one dev deploying their jar to an ec2 instance moves on to another company, how does the next dev even understand what this jar stuff is when they just want to push their SPA to vercel.

Allowing devs to do whatever they want works at small new orgs but you need to put some kind of shape on it as they grow.

adrianmsmith3y ago

> What about when the one dev deploying their jar to an ec2 instance moves on to another company, how does the next dev even understand what this jar stuff is

I think that's more a problem of one person vs a team, as opposed to deploying JARs on EC2s vs cloud and a bunch of custom tooling.

If you have a team of devs and they all deploy JARs to EC2, one of them leaving won't be a problem, the rest will still know how to do it. If you were to have a single platform engineer who's built a bunch of custom tooling over a bunch of Kubernetes files, and nobody knows how it works or where the files are, and then they leave, you've got the same problem as the solo EC2 dev leaving.

1 more reply

switch0073y ago

> Sounds a bit like those places aren’t at the scale where a platform team makes sense.

But so many orgs want to believe they are Google scale and you’re stuck with premature teams such as Platform Engineering. Then it just explodes and suddenly you have a director of platform Engineering and multiple sub teams and suddenly OKRs, ADRs, RFCs, team charters and perpetual Kubernetes upgrades and nobody can question its existence any more

1 more reply

re-thc3y ago

> and deploying mostly meant getting my jar file or whatever running on some ec2 instance

And then you: - deploy the wrong jar

- someone else overwrites your jar

- the jar and its libraries has a vulnerability (e.g. log4j)

- the jar doesn't support the Java version installed on the EC2 instance

- the EC2 instance isn't patched and gets hacked

- the EC2 instance generates lots of logs, fills up the disk and crashes

... and many more

There are reasons some things are put in place. It's not just a sunk cost. Is insurance just a sunk cost? There's always a risk factor to it.

Platform doesn't need to be a huge team or lots of complexity. Lots of things can be automated but you still need to cater for important essentials.

kodah3y ago

If your whole company knows infrastructure well enough to build their own stuff and integrate it to existing solutions that are more or less islands of their own then you don't need a platform engineering team.

PE teams are more necessary when your development teams grow to include people who don't know infrastructure or when your compliance and security requirements need to scale passed most developers knowledge. At that point the overhead and abstractions are worth it.

> Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.

Generally speaking, on every platform team I've worked on we've been able to maintain the ability of developers to continue to interact with the raw infrastructure as code. That's a careful dance done solely for the benefit of power users. Not every PE team knows this lesson though.

jcpst3y ago

It’s hard. I’ve been on one of these kind of teams for a few years now.

We have probably 50-70 teams and well over 2000 deployable products.

There are good things, for sure. But of five teams, we’re the only one of them that is focused on the ‘customer’ (application developers).

The devops/infra teams provide ways for AppDevs to build what they need, but there seems to be no good abstractions being made.

Our team is named and presented as a team that provides common libraries, templates, ‘golden paths’, etc. But then the reality is we have barely any time for that. Instead we get tasked with projects that are indeed important from a $$$$ perspective, but it doesn’t fit well into an existing capability team.

Which is fine, but it feels dishonest to the rest of the engineers that are using our products thinking that’s the main thing we do.

yard20103y ago

It's fun for hacking but it won't scale gracefully. It's like saying all you need is create-react-app, usually a few months into the project you understand that it won't scale well and then the shit show begins

matsemann3y ago

But our 6x ec2.small instances could serve the whole country buying public transport tickets every day. The k8s setup at a different place serve like 1/100th the amount of daily purchases, but have over 150 python pods to handle different stuff. Yeah, python is slow, but the complexity of the infrastructure is just insane. Yes, it's infinitely scalable, but we would never need that.

1 more reply

darkr3y ago

> Now I need to write a dockerfile, loads of manifests, know about ingresses, use whatever custom tooling and abstractions they've built on top of the cloud provider.

Doesn’t sound like there’s a whole lot of platform engineering here.

The general aim for a platform engineering team should be a UI/CLI that allows a dev to get a new service into production in minutes. Metrics, tracing, monitoring, alert routing, logging, DB, CI/CD, service/RPC stubs etc all done for you so you can get to writing code fast and not worry about the tower of complexity underlying all of this.

nyrikki3y ago

If "stability and velocity" are anything but metrics with empirical data as to their validity compared to historical metrics, perhaps you are misinformed about the reasoning behind them.

Adoption of products that enable one particular flavor of org structure that is shown to be useful won't help you if you don't adjust your org structure.

Without adoption of practices that take advantage of those shifts in complexity you will never receive the full benefits from them.

I highly encourage you to research the reasons behind these shifts in complexity, particularly the ways they are intended to increase independence of teams to increase organizational scaling ability.

Empathy is another area to perhaps work on. Because I was that 'single guy' for years and guess what, I missed every family graduation, wedding and funeral for a decade; had months where I had to wake up and restart you jboss instances every 45min 24 hours a day and then still had to be in the office at 9am for meetings where you would punt a ticket to next sprint to fix it.

Platform engineering done right is like an internal SaaS provider, and you should have embeds to help you with interfacing with them. Abstraction to allow for vendor mitigation using tools like Terraform is a good practice but not super custom.

But you can be bitter and complain, working for a company who chooses solutions on the golf course; or you can figure out what your org needs, find sponsors and allies and make positive changes.

If you feel the platform group is adding hurdles, you aren't working at a place that is doing platform engineering or you are understanding why some requirements need to be implemented.

That said as all cloud providers are SoA based, large amounts of custom tooling is a red flag that your org is not SoA based and you are going to have a bad time anyway.

As for manifests and egress, if you didn't care about them before, you probably were releasing insecure, unreliable balls of mud. So yes there will be an adjustment to becoming more of a software engineer. But that is just the reality of working in larger systems on more professional teams.

If you dislike a SoA model, ITIL and ITSM would have been true hell.

The company wide meetings to add basic services were way more subject to bike shedding and blocking in those days.

Anyways being proactive and helping a company alignment with modern practices doesn't happen by itself. If you are passionate about this, run with it.

For some services, copying a jar file is a completely valid pattern in SoA FWIW, and adding complexity without value is an anti-pattern.

Just like with programming, it is easy to forget you aren't the customer and to implement features that your customer don't want or need. Assuming that you don't have a reputation for being a Karen, try communication with the platform team and let them know what your pain points aren't. If they won't let you in the room make your case to someone who will sponsor you to get in the room.

But first realize that having a single person be a single point of failure, then expecting them to sacrifice their entire life to be on call for an entire organization simply isn't realistic.

I wish I hadn't lost two long term relationships and decades of family time doing it. I didn't do me or the companies I was at any favors.

anotherhue3y ago· 5 in thread

You'd be amazed how little platform engineering is needed if you can violently constrain technology choices.

Give me a blobstore and some VMs and I'll be happy, and if someone raises the slightest fuss I will point out that it's still years better than helm.

CyberDildonics3y ago

Give me a blobstore

Isn't this just a file system?

lacksconfidence3y ago

Typically networked. Perhaps file system like, but not POSIX. Something much simpler and with fewer edges.

hhh3y ago

What’s wrong with helm?

pirates3y ago

i think helm is “fine”, but the more i use it and get deep into some more advanced uses the more i dislike it.

helm has a tendency to sprawl really fast. the ability to template any part of a manifest can lead to “if there is even a chance the default will ever be changed we should make it a template” (at my company at least). when your entire chart is “if values.thing X, else Y”, it makes it difficult to be confident about what exactly will be deployed. especially if you only have a chart and values can be inserted in many places (helmfile). without helm template it’s sometimes impossible to reason with all the meta-templates.

speaking of debugging, helm template sometimes has very awful debug output and traces of where your error is. in more simple charts it’s not so bad. with library charts you can spend hours debugging, say, your deployment.yaml because it’s saying there’s an error on line 4 of the template, but line 4 is just some label or benign key, and oh finally i figured out that the deployment expects the configmap to be rendered, and the typo is in line 4 of the CM, not the deployment

helm lint doesn’t catch duplicate keys? i found that one weird

tools like argocd don’t like helm too much. yeah you can make it work, but flat yaml is so much simpler. and if you use helm + argo, it turns the helm into flat yaml anyway

i have heavy dislike of the syntax of ‘and values.thing1 values.thing2’ in a templated line. (same with ‘or’, though this might just be a personal hangup)

finally, to be positive, the one thing i really like about helm actually is the checksum function that changes a label on the deploy/sts/ds when you change config, so pods automatically roll over.

nforgerit3y ago

IMO lots of accidental complexity with no real value (in most use cases). I saw myself disagreeing frequently with pre-defined charts and creating an own chart just for one use case is better done with kustomize. It might make sense though for a bigger org with at least several dozens of engineers to ensure certain internal standards (in which platform engineering might make sense as well).

techperson200073y ago· 3 in thread

Ahh yes. The bi-monthly shit on ops comment threads.

Ops folks can barely reverse a string. We should not exist as a field. Every company regardless of scale should just use 4 manually provisioned EC2 instances. If that doesn’t work your code is shit and you aren’t using Elixir / Phoenix! Anyone even attempting to use Kubernetes should be branded with the Hetzner logo and forced to work at McDonalds.

Fear not HN. McDonalds here I come.

matsemann3y ago

That's a lot of straw people.. There are lots of valid criticism, dismissing all of that is arrogant and counter productive.

slyall3y ago

The largest thread here is people saying "deploying mostly meant getting my jar file or whatever running on some ec2 instance" and talking about manually setting up ec2 instances and each team somehow copying them into production.

Just last week there was a thread about a company with 250 Petabytes in s3 and every comment from a developer was how it should be moved in-house. People were linking to one piece of s3-compatible software that the only example of a live deploy was single-digit gigabytes.

deterministic2y ago

Mostly justified in my experience. In my experience, every time an “ops” team gets involved, complexity and cost goes up, with zero or negative benefits.

Your experience might be different of course, but that doesn’t invalidate the negative experiences of others.

BlackFly3y ago· 2 in thread

What does it mean to be greater than the sum of your parts? It is more than just alignment, everybody moving in the same direction: that would just be a well directed sum.Once upon a time, management talked about alignment because just being the sum of your parts would be great if it was all in the same direction. Now, for everybody obsessed with being a force multiplier, it needs to be more than just that.

You should find yourself more capable in such an organization, but that is a tall order in today's world. Almost every problem you will face has been solved and help is easy to find even on your own. So being in such an organization should make you more capable of solving problems than your base ability. Someone must have already solved that problem for you and the solution must be in easy reach. That's still a tall order, most solutions you find described on the internet are also in easy reach nowadays.

Your platform can also make people less capable. It is easier to be a force multiplier with a coefficient less than one. Make it more difficult to solve problems. Require them to communicate with many teams in order to accomplish anything. Make the solution difficult to integrate and operate and highly non-standard. If the solution is non-standard and the documentation is anything short of impeccable, it will basicly be unlearnable.

The book Accelerate and the research backing it indicates that the most productive teams are able to redesign their entire system without talking to anyone and to complete their work without communication and coordination to other teams. Effective platforms advance that. They make it easy for people to redesign their systems to implement good designs. Bad platforms reduce that capability beyond what a person would ordinarily have and user complaints are brushed off with a tautological, "You can't make everyone happy." Bad platforms narrow possibilities and then require product teams to beg platform teams to allow ordinary designs that were cut off.

I'm on a platform team and I struggle sometimes trying to figure out if we are helping people or hurting them hence this philosophical rant.

mkl953y ago

> What does it mean to be greater than the sum of your parts?

In my experience it's mostly automation and observability. Delegating as much stuff as possible on tools as opposed to humans, empowering the latter to fix things.

BlackFly3y ago

Yes, I agree. Empowering humans to fix problems. It seems to me attempting to prevent people from making problems just creates more problems than it solves. Computers do things quickly with terrible (non-existent) judgement.

layer83y ago· 1 in thread

After reading the article I still don’t know what “platform engineering” is meant to refer to.

KineticLensman3y ago

> After reading the article I still don’t know what “platform engineering” is meant to refer to.

It wasn't obvious to me either, so I found the following definitions. I still think that the article doesn't say 'how PE works' as opposed to providing a bunch of tips.

"The engineering platform is created and maintained by a dedicated product team, designed to support the needs of software developers and others by providing common, reusable tools and capabilities, and interfacing to complex infrastructure." [0]

"Platform engineering is the process of designing and implementing toolchains that improve the software delivery experience. Platform engineers set up automated infrastructure and self-service controls that allow developers to work more efficiently." [1]

"Platform engineering is the “discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an ‘Internal Developer Platform’ covering the operational necessities of the entire lifecycle of an application." [2]

[0] https://www.gartner.com/en/articles/what-is-platform-enginee...

[1] https://www.howtogeek.com/devops/platform-engineering-vs-dev...

[2] https://humanitec.com/blog/sre-vs-devops-vs-platform-enginee...

oofnik3y ago

Here's the thing about buzzwords - they are a useful abstraction for grouping together a set of concepts for non-tech people to make decisions with tech impact.

The ideas presented here are a good summary of the day-to-day issues I experience as a "DevOps engineer." I wish I had the pull to instigate a shift in mindset towards a more platform-oriented mode of work, but unfortunately for me, that is not the case. Maybe it's wishful thinking, but forwarding articles and presentations like this to senior directors can be a way to influence the direction of their thinking to align more with what you believe to be in the best interest of the organization.

1 more reply

ofrzeta3y ago

In my opinion a lot of Devops comes down to improving "developer experience" (DX - another buzzword, but still). It starts with project setup and choosing tools that enable this such as JS build tools that are fast. Then you have easy (and again: fast) means of running tests, continuous integration, deployment for staging (or sharing with less technical team members). This has to be as frictionless as possible so developers can actually enjoy their work. A "by-product" is that you can also more easily deploy for production.

leetrout3y ago

I really like that outcomes over outputs is called out. I think this is important whenever programming / development meets The Real World(tm) _ESPECIALLY_ with internal tooling work.

deterministic2y ago

The whole point of “DevOps” was to not have a separate platform/deployment/operations team. Each development team should handle everything needed to build, test, deploy, and monitor production code.

Every time I have been on a team where that was not the case, the result was bloated complexity and more failures in production. Usually because the platform/deployment/operations team was composed of not very skilled developers trying to justify their salary to management.

If you have worked with a platform/deployment/operations team that was competent and reduced complexity then great for you. I have yet to experience such a phenomenon. Please let me know in comments why you think your experience was different.

swyx3y ago

> Platform Engineering is the application of a Product Mindset to supporting your engineering organization's software delivery velocity and system stability.

so… move fast and don’t break things?

must be nice to never have tradeoffs

CSMastermind3y ago

I really dislike the co-opting of the phrase "Platform Engineering" as a replacement for DevOps.

imwillofficial3y ago

I despise this buzzword.

j / k navigate · click thread line to collapse