I've been around long enough to know that any "no code" style interface or GUI are typically the _problem_ not the solution. Regardless of the code they export, you end up with fat fingers, misclicks, forgotten UI paths to follow... Taking a software eng approach to shipping infra is a stable, known process that the infra team and the software teams can understand, no specialized GUI tool knowledge required.
I've been using the same basic terraform modules, jenkins pipelines, and infra architecture for nearly 7 years across multiple companies and numerous cloud deployments. It's not fancy but it justworks.jpg. Every time I re-use that code for a new deployment or account I save TONS of time.
Devops doesn't have to be hard. Infrastructure doesn't have to be complex. Deploying every day isn't _that_ difficult. KISS Method is key, especially when you're looking for speed. Using _less_ tools from the CNCF is better, and will let you move faster, not adding a new one.
That's simply not true for anything larger than a few services and a small dev team. The cloud is very complex to do right when you focus on security, performance, and scalability. And Terraform invariably devolves into a nightmare when you have a ton of resources with dependencies between them.
However, I do think that this is mostly essential complexity, rather than accidental one. We're now building systems that are way more secure and/or scalable than before. Least possible network access and permissions everywhere already add a bunch of complexity. Pushing complexity from our code to managed cloud offerings does its part, too. But all of this can be tamed very well with modules and reusable components.
That said, if you're scaling Terraform, I do recommend you to check out the tools that have sprung up in the recent years to manage it. I'll personally recommend Spacelift[0] (see disclaimer). It can help you orchestrate your statefiles once you start having many of them (even tens or hundreds of statefiles in a single workflow are no problem) using stack dependencies, help team members self-serve through blueprints, automate all the things through OPA policies, and generally help you scale your Terraform usage to a larger team.
[0]: https://spacelift.io
Disclaimer: Software Engineering Team Lead at Spacelift, so take the recommendation with a fair grain of salt; I do legitimately think it's a great product though. If you'd like to reach out, feel free to do so through the website or the contact details in my profile.
At Netflix our goal was always to build tools where the majority of devs just check into source control and click a few buttons, but could go as far as configuring kernel tunables if necessary (but also making that as unnecessary as possible).
In software, there are at least three models I can think of immediately: code, configuration and user data.
Why do I separate configuration? Isn't configuration just code or data? I don't think it is. It is data _about_ a particular system, as opposed to a particular user.
Why the distinction here? The code of a system can be designed, developed, and tested against a set of supported configurations. At that point, the system might only run under one configuration at a time, but can be trusted from a requirements perspective to operate under other configurations without needing to go through the whole software development lifecycle again.
Why not just store this in user data, then? Different requirements. Three off the top of my head: configuration data wants much better change management than most user data does. That management wants to be exportable and importable. It wants different access controls.
Historically, configuration data change management has been done in SCM, such as git. The reason why git isn't a big deal in development is because it is not a point of particularly high friction relative to the other parts of the software development lifecycle. It is a _much_ bigger point of friction in configuration changes.
Hence, three models.
We can argue about whether or not configuration changes _ought_ to go through the full cycle, because I am wrong to trust _any_ change to a system with anything less. My practical experience suggests that most of the time, the damage done is less than the cost of enforcing a strict lifecycle on everything.
I define a resource, and provide a whole set of knobs on that resource. That's the code part. I test that code against a variety of configurations, the same way I might unit test application code against a variety of app configurations. I also verify that changing knobs from one setting to another behaves. With automated testing, this actually isn't all that hard to do. Once I've verified things work right, I deploy.
At this point, I will default to trusting that things will work. This is the configuration part. Set these knobs to whatever permitted value you want, and the system will update behavior based on those new values. Most of the time, things like this work. That is good enough for me.
Just don't ever ask to roll back...
k8s is also KISS but it brings even more 'out of the box' like logging and monitoring, would highly recommend you to take a look perhaps you like it.
Terraforms state management is bad and a lot of people don't get that you store secrets in them. Bootstrapping this securly already needs infrastructure like remote stores.
Jenkins is fine i would say but with argocd you actually gain real insight. Argocd is also IaC and you can manage argocd through argocd.
The adoption of argocd in the platforms i have build, is great. Developer teams love it, get used to it very fast and don't need cluster access/ (in your case vm access).
With k8s you also get zero downtime deployment, blue / green basically for free.
I dunno, usually I find databases and migrations to be the hard part. At this point, I have enough examples of app deploys that I can have a new app up and running on a pair of VMs with a robust blue/green deploy and backups inside of an hour or two, with deploy by Github Actions responding to pushes to prod branch.
Even if you don't have my company's half-decade worth of example devops, you can do something easier, like a single instance on a Digital Ocean machine with deploy by "ssh -A server 'cd yourapp && git pull && sudo systemctl restart yourapp'". Sure, you'll have a few seconds of downtime, and you'll expose your SSH keys to anyone on that box for those few seconds, but if you know some Linux and nginx, you can get this working inside of an hour from scratch.
They historically took on the reliability role, if nobody else did, but they were implementing reliability on top of a house of cards, which is a kind of hypocrisy that makes even mediocre devs bristle. Don’t lecture me on robust software, boyo. Your tools are made of string cheese and staples.
I don't know why you'd blame ops for the crappyness of the tools they have at their disposal.
Yes, Ansible, Salt, Puppet, and Chef are spaghetti-code inducing congealed messes of design. So are large collections of complex shell scripts.
So what's the alternative? What spherical cow of a configuration management tool from Platonic dev heaven shall be foisted on us this time?
I'm sure it'll be super clean and elegant this time, unlike the last thousand shitty tools they made.
And don't get me started on devs that think they're qualified to do ops when all they know is their language of choice (if even that) and have never thought about the network, security, capacity, redundancy, failover, reliabililty, hardware, backups, the rest of the company or other users.
Terraform and Ansible look like gyroscopes compared to the build process of any modern software stack. We offered our dev teams a whole ass pizza party every time they had 10 green builds (on main) in a row. In three years we've paid it out once.
The problem comes from the management/business side. They hire devs and tell them that ship features as fast as you can. Also they hire ops guys and tell them that I want this whole thing super reliable, we can’t afford a minute downtime.
In my opinion this is why DevOps is mostly pointless. We are trying to fix with tooling, processes, new tech, and fancy roles the fact that business people don’t want to make compromises or choose between the pace of delivery and reliability.
If you are open to it, try configuring an azure function app to use GitHub in the deployment center. I heard actual gasps from certain team members when they saw it automatically push the GH action workflow file into master and kick off the job without any additional bullshit beyond the GH authentication ceremony and org/repo/branch selection.
For me, the following tools make that a joy: - Postgres as the database, which is very predictable and extremely reliable; - Migrations with Ruby on Rails, that have just the right balance between a convenient DSL and letting you write SQL when necessary; - The strong_migrations gem that catches in development unsafe commands to run in production, and explains how to make them safe
I say bring it on; more variance and more disruption in this space as people try new approaches might be what we need to get us out of the rut we've been stuck in for too long. No idea if it will work, but good luck to Adam and his team.
My current setup is 'get a k8s cluster spup up and configured properly as fast and easy as possible' and than just use argocd. Argocd is by far the best tool i have been using in the last 15 years: It does exactly what it should do (syncing and showing me k8s insight vs my git repo), can manage itself through the same mechanism (IaC) and people of different backgrouns are very fast in using it.
This tool either might bridge the gap for people and potentially solve problems but i do have to say: argocd.
Even if you think you want to start small and just use kubectl: start with argocd.
I seriously want to know which places this is! I've been at 5 different companies, and I've never been a place where people don't look at me like I'm speaking French when I suggest dark launching a feature or introducing feature toggles. I've yet to experience a place that actually integrates continuously, as opposed to merely having a ci pipeline without actually doing continuous integration.
Didn't know the differences could be that huge.
Yes, the tooling is bad, all tooling is bad, but hearing the same old 'Infrastructure should "just work"' trope is getting old. Such developers should stop grandstanding and roll up their sleeves. Learning about TCP isn't beneath you.
Its literally beneath you as an application developer in the TCP/IP stack :budumptss:
No tool in the world is going to convince an executive to trust their people, to take risks with uptime and stability, and to break production as a necessary part of organizational learning. No, that requires executives to feel supported by other executives. Tools do not create collaboration and trust; people do.
Then came the great YAML plague and we had to give up our title and general purpose languages in favor of silly names, templates and DSLs. But you still have to understand the OS, the hardware, and have real world experience with availability, redundancy, etc. So the new generation of "DevOps" who was raised directly on terraform and k8s failed miserably in achieving any results.
Anybody saw any junior Ops (DevOps, SRE, GitOps, wtfeverops) job openings in the last few years? No? I wonder why. The tools are better, no? It should be easier than ever to deploy. We have all this micro-service orchestration and all those beautiful public clouds. All the conferences and the marketing are saying it's a breeze. You don't have to worry about ha, replication, iops and so on, we'll put that on your bill thank you very much.
So here comes Dev again with a solution: click-click-drag Ops. Surely this will fix things, surely you can now hire right from the street and train to deploy. Or will my old admin ass get even pricier as the demand ever rises and the supply is dwindling? Stay tuned to find out.
Perhaps the reason your organization is only deploying once a month is the same reason it takes it a month to make simple code changes, which is because you haven't hired sufficiently capable engineers, and not because you're missing some magic.
edited: grammatical error
IaC is a really powerful concept, and system initiative does not need to be in conflict with that paradigm, just another layer of abstraction that still allows IaC.
The main issue is how to combining UI state + and manual state.
The worst thing you can do imo, is to use a common representation, eg the UI would try and edit your manually written declarations. That is just a recipe for disaster.
The answer to this type of mixed editing is a layer approach, eg what is being done in the USD format (https://openusd.org/release/index.html)
Each authoring "instance" has full control of its layer, and composition semantics define how the layers compose to the final declarative structure.
I've been working as a devops engineer for 6 years. I'm done. I quit and never gonna do this job again. Good luck with creating more complexity with tools.
I am not worried about UI representation of the model like many comments, it is not the main point of this project as I understand. UI just that - a representation, same relationships might as well be coded in HCL or the like of it.
the only mitigation to this is less. less tooling, less infra, less abstractions. you want that delta approaching zero as uptime goes to infinity.
i’m not sure how replacing walls of code with walls of yaml of walls of gui graphs change anything at all.
i suppose it’s possible some paradigm leap in infra understandability is hiding in crazy nextgen ui/ux, but i’m not holding my breath.
the last leap i encountered was moving from the python sdk to the go sdk for manipulating aws. this was significant, but still more of a qol improvement than something fundamental to the solution space.
Have been following progress, excited for where it goes.