Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?
Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are. I haven't delved into Terraform yet but I tried doing a project in Pulumi once and felt by the end of it that I might as well have just wrote it in AWS CDK directly.
Just this week, I wrote a Terraform module that uses the GCP, Kubernetes, and Cloudflare providers to allow us to bring up a single business-need (that will be needed, hopefully, many times in the future) that spans those three layers of the stack. 200 lines of Terraform written in an afternoon replaced a janky 2,000 line over-engineered Python microservice (including much retry and failure-handling logic that Terraform gives you for free) whose original author (not DevOps) moved on to better pastures.
CDK is fine if you're all-in on AWS. It has its tradeoffs compared to Terraform. Pick the right tool for the job.
A big plus is that Terraform works outside of AWS land.
CDK is a nightmare to work with. You're writing with programming-language syntax, which tempts you to write dynamic stuff - but everything still compiles down to declarative CFN, which just makes the ergonomics feel limited. The L2 and L3 constructs have a lot of implicit defaults that came back to bite us later.
With CDK you get synth and deploy, which felt like a black box. Minor changes would do the same 8 minute long deploy process as large infrastructure refactors. Switching to TF significantly sped up our builds for minor commits. There might be a better way to do this with CDK (maybe deploying separate apps for each part of our infrastructure) and we may have just missed it.
Also recently I was forced to use Google Cloud Deployment Manager scripts for some legacy project we were migrating to Terraform, and I was shocked at how buggy and useless it was. Failed to create resources for no discernible reason, couldn't update existing resources with dependencies, couldn't delete resources, was just unfathomably shit all around. Finished the Terraform migration earlier this morning and everything went off without a hitch, plus we got more coverage for stuff Deployment Manager doesn't support. It's also organized much nicer now, with versioned modules and what-have-you.
Cloudformation is ugly and again, surprisingly isn't well supported by AWS. I don't understand how it's possible, but terraform providers seem to be more up to date with products and APIs. Maybe that's just me but I've seen others complain about the same thing.
- Not all cloud providers have an infra-as-code offering of their own in the first place (especially true with traditional server hosts), whereas pretty much every provider with some sort of API most likely has a Terraform provider implemented for it.
- Terraform providers include more than just PaaS/IaaS providers / server hosts; for example, my current job includes provisioning Datadog metrics and PagerDuty alerts alongside applications' AWS infra in the same per-app Terraform codebase, and a previous job entailed configuring Keycloak instances/realms/roles/etc. via Terraform.
Venturing away from opinions, the provider ecosystem with terraform enables some wonderful design options. For example, I have a module template that takes some basic container configs (e.g. ports, healthchecks) and a GitHub URL, then stands the service up on ECS and configures CI in the linked repo. CF can't do that.
Also some big corps run their own internal datacenters and have cloud-like interfaces with them. You can write TF providers for that (its not going to be as nice as the public cloud ones, but still nice). Then you can utilize Terraforms multi-provider functionality to have 1 project manage deployments on multiple clouds that include on-prem.
Also terraforms multi-provider functionality is also useful for non aws/azure/gcp such as Cloudflare. As far as I know CDK does not support that.
For me the killer feature is that both plan and apply show the actual diff of changes vs running infrastructure. It makes understanding effects of changes much easier.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...
https://blog.mikaeels.com/what-does-the-aws-cdk-diff-command...
AWS GitHub Opsgenie Okta Scalr TLS DNS
This smells like kubernetes
> Terraform over CloudFormation/CDK
They both work. It's more about which providers you need.
[0] https://buildkite.com/blog/manage-your-ci-cd-resources-as-co...
[1] https://registry.terraform.io/providers/integrations/github/...
I think HCL is an under appreciated aspect of Terraform. It was kinda awful for a while, but it's gotten a lot better and much easier to work with. It hits a sweet spot between data languages like JSON and YAML and fully-general programming languages like Python.
Take CloudFormation. The "native" language is JSON, and they've added YAML support for better ergonomics. But JSON is just not expressive enough. You end up with "pseudoparameters" and "function calls" layered on top. Attribute names doubling as type declarations, deeply nested layers of structure and incredible amounts of repetitious complexity just to be able to express all the details need to handle even moderate amounts of infrastructure.
So, ok, AWS recognizes this and they provide CDK so you can wring out all the repetion using a real programming language - pick your favourite one, a bunch are supported. That helps some, but now you've got the worst of both worlds. It's not "just JSON" anymore. You need a full programming environment. The CDK, let's say the Python version, has to run on the right interpreter. It has a lot of package dependencies, and you'll probably want to run it in a virtualenv, or maybe a container. And it's got the full power of Python, so you might have sources of non-determinism that give you subtle errors and bugs. Maybe it's daylight saving gotchas or hidden dependencies on data that it pulls in from the net. This can sound paranoid, but these things do start to bite if you have enough scale and enough time.
And then, all that Python code is just a front end to the JSON, so you get some insulation from it, but sometimes you're going to have to reason about the JSON it's producing.
HCL, despite its warts, avoids the problems with these extremes. It's enough of a programming language that you can just use named, typed variables to deal with configuration, instead of all the { "Fn::GetAtt" : ["ObjectName", "AttName"] } nonsense that CloudFormation will put you through. And the ability to create modules that can call each other is sooo important for wringing out all the repetition that these configurations seem to generate.
On the other hand, it's not fully general, so you don't have to deal with things like loops, recursion, and so on. This lack of power in the language enables more power in the tools. Things like the plan/apply distinction, automatically tracking dependencies between resources, targeting specific resources, move blocks etc. would be difficult or impossible with a language as powerful as Python.
HCL isn't the only language in this space - see CUE and Dhall, for example - but it's undoubtedly the most widely used. And it makes a real difference in practice.
Try to keep your stateful resources / services in different "stacks" than your stateful things.
Absolutely 100% completely obvious, maybe too obvious? Because none of these guides ever mention it.
If you have state in a stack it becomes 10x more expensive and difficult to "replace your way out of trouble" (aka destroy and recreate as last resort). You want as much as possible in stateless, disposable stacks. DONT put that customer exposed bucket or DB in the same state/stack as app server resources.
I don't care about your folder structure, I care about what % of the infra I can reliably burn to the ground and replace using pipelines and no manual actions.
Everyone else seems to be reading over the typo or I'm more confused than I thought.
And, so, you're saying: try to have a separate deployment (stack then?) that contains the state, so you can wipe away everything else if you want to, without having to manage the state?
And yes, that's a succinct rephrasing.
When you first use iac it maybe seems logical to put your db and app server in the same "thing" (stack or state file) but now that thing is "pet like" and you have to take care of it forever. You can't safely have a "destroy" action in your pipeline as a last resort.
If you put the stateful stuff in a separate stack you can freely modify the things in the stateless one with much less worry.
At the end of the day I don't care what other people call them.
A CDK stack, (assuming that's what is used here), would be loosely equivalent to a Terraform HCL module.
It is reasonable to assume if you are using Terraform to manage your infra, than your infra likely has access to a secrets manager from your infra vendor, e.g., AWS. Instead I'd recommend using a Terraform data resource to pull a credential from the secret manager by name -- and the name doesn't even necessarily have to be communicated through Terraform state. Then the credentials could directly be fed into where it is needed, e.g., a resource like a Kubernetes Secret. One can even skip this whole thing if the service can use the secret manager api itself. Finally access to the credentials itself would be locked down with IAM/RBAC.
The root module can have outputs just like any other module. These outputs can be accessed from other stacks from the backend.
And if you use CDKTF the references are handled transparently.
But, what if that Lambda depends on a VPC with some specific networking config to allow it to connect to some partner company private network? And, it's difficult to recreate that VPC without service disruption for a variety of reasons that are out of your control. Well, now you have state because you need to track which existing VPC the Lambda needs if you tear the Lambda down and recreate it.
But I’m coming at this from a GCP lens and got half way through the article about how the recommended unit of isolation in the AWS environment is entirely different AWS accounts and I’m kind of hung up on that. Is that really a thing people tend to do often? Doesn’t it get super unwieldy? How does billing work? What about identity? I have so many questions.
EDIT: Despite the fact that the root resource in both GCP and AWS is an organization, when I heard “account” I mistook that to be AWS terminology for an organization.
At the top level you have an organization account, which is where billing occurs.
From this org account you create accounts for the following (typically):
1. Security - AKA the account your USERS are in 2. Ops - The account your monitoring, etc are in
From here where a lot of people seem to deviate (I've been interviewing level 2-3 SREs for the last 3 weeks and have heard all about different AWS structures that I don't like) is how to break up your applications into their own accounts for a low blast radius.
What I DO, and is well known as being the best practice, is to create an AWS account for each environment of each application.
App1-sandbox App1-staging App1-production
Then your terraform is also structure by application/environment/service. Each environment and application has it's own state in s3 and dynamodb.
And so on.
Is this unwieldly? I have 40-50 AWS accounts and no it's not unwieldly at all IMO. Cross account IAM and trust relationships are set up very early on and they don't need to be modified much if any at all until you create another AWS account. Creating a new AWS account is kind of annoying, though. I need to automate that process better.
https://aws.amazon.com/organizations/getting-started/best-pr...
Right now we’re using one account per env but also see downsides and thought of going the next step to do one account per env and tribe/team.
You can do that with terraform...
There are AWS systems above the account level for managing it (Organizations), so its not quite as bad as it might naively seem, but, yes, its more unwieldy than GCP’s projects.
AWS IAM doesn’t do capability inheritance; if I can write a policy at all I could grant any privilege to any resource in that policy, even privileges I don’t personally have. It’s easier for each groups of admins to have a separate AWS account than to put everyone in security boundaries that try to wall them off.
The same thing happens with the Kubernetes provider when you try to use it with multiple GKE clusters.
Billing works by having a billing aws account that all other accounts are in a sense "children" of.
I'll never want to do this without Terragrunt again. The suggested method of referencing remote states, and writing out the backends will fall apart instantly at that scale. It's just way too brittle and unwieldy.
Terragrunt with some good defaults that will be included, and separated states for modules (which makes partial applies a breeze) as well as autogenerated backend configs (let Terragrunt inject it for you, with templated values) is the way to go.
The problems I have personally experienced with this approach are:
- if you update one of the root Terraform states, you need to execute a Terraform apply for every repo that depends on that Terraform state; developers do not do that because either they forget or they do know but are too lazy and subsequently are surprised that things are broken
- if you use workspaces for maintaining the infra in different environments, and certain components are only needed in specific environments, then the Terraform code becomes pretty ugly (using count which makes a single thing suddenly a list of things, which you then have to account for in the outputs which becomes very verbose)
Is Terragrunt something that would help us? I do not know Terragrunt, and a quick look at the website did not make that clear for me.
I've kind of found terraform is dying and encourages a lot of bad practices but everyone agrees with them because HCL and it is transferable as most companies are just using TF.
I don't think it's dying. The hype has worn off. Everybody uses it. It's very mature. There's a module for everything.
It's just not new and sexy anymore IMO.
After using it for a few months all of the features found in tarragrunt are in terraform.
Terraform still does not let you this.
It becomes very problematic when using providers that are region specific, amongst other scenarios.
That being said I don’t like the extra complexity terragrunt adds and instead choose to adopt a hierarchical structure that solves most of the problems being able to dynamically render providers would solve.
Each module is stored in its own git repo.
Top layer or root module contains one tf file that is ONLY imports with no parameters.
The modules being imported are called “tenant modules”. A tenant module contains instantiations of providers and modules with parameters.
The nodules imported by the tenant modules at the ones that actually stand up the infrastructure.
Variables are used, but no external parameters files are used at any level (except for testing).
All of the modules are versioned with git tagged releases so the correct version can easily be imported.
Couple this with a single remote state provider in the root module and throw it in a CI/CD pipeline and you have a gitops driven infrastructure as code pipeline.
Managing multiple environments is much easier in TG. State management in TF is kneecapped by the lack of variable support in backend blocks. I can only assume it's to encourage people into using terraform cloud for state management.
For example, say you're using Terraform to manage AWS resources, and you've provisioned an Active Directory forest that you in turn want to manage with Terraform via the AD provider. Terraform providers can't dynamically pull things like needed credentials from existing state, so you end up needing two separate Terraform states: one for AWS (which outputs the management credentials for the AD servers you've provisioned) and one for AD (which accepts those credentials as config during 'terraform init').
Terragrunt can do this in an automated way within a single codebase, redefining providers and handling dependency/dependent relationships. I don't know of a way to do it in pure Terraform that doesn't entail manual intervention.
I agree with the splitting, but based on many home-grown automation systems I've seen around this I'd really recommend you to use one of the specialized CI/CD systems that are built around automating these kinds of workflows. Once you reach the "many state files" phase, you'll save a lot of engineering time this way.
They'll take care of, among others, running the right state files, in the right order, with the right parameters. But they'll also take care of many other things you need to run Terraform at scale and with big amounts of engineers (happy to expand but don't want to kitchen-sink this comment).
Disclaimer: Take this with a sensible grain of salt, as I work at Spacelift[0] - one of the TACOS (and of course the one I'll shamelessly link and recommend!).
But really, don't use tools like Jenkins for this as you scale, it'll likely hurt you in the long run.
[0]: https://spacelift.io
Great product though from what I've experienced.
For Terrateam[0], we have probably 70% of the enterprise offering but at around 1/10th the price. If there are any features that are deal breaker, feel free to reach out to me and we'll see what we can do. That being said, Spacelift is a much more luxurious piece of software than us. We are very utilitarian, but we have to rationalize that low price-point somehow.
If you haven't yet, please try talking to our sales team. There's usually a way to make all sides happy with some custom agreements - after all, we'd love for you to be able to use our product as much as you need.
To be fair, I haven't used terraform -chdir yet.
As sibling said, use data source to read remote state outputs.
Do you use the same security group for all of your instances?
i am usually creating a security group per group of related ec2 instances.
Gruntwork is a really cool company that makes other tools in this space like Terratest [1]. Every module I write comes with Terratest powered integration tests. Nothing more satisfying than pushing a change, watching the pipeline run the test, and then automatically release a new version that I know works (or at least what I tested works).
Some minor examples:
- calling a module multiple times using `for_each` to iterate over data works, except if the module contains a "provider" block
- if you are deploying two sets of resources by iterating over data, terraform can detect dependency cycles where there are not any
I’ve gone back and forth on workspaces versus more root modules. On balance, I like having more root modules because I can orient myself just by my working directory instead of both my working directory and workspace. Plus, I feel better about stuffing more dimensions of separation into a directory tree than into workspace names. YMMV.
Why not put them in seperate repos that can be tagged and versioned and then referenced like below?
source = "git::https://bitbucket.org/foocompany/module_name.git?ref=v1.2"
The comments in here kind of made me think I was going to hop in and take away some huge wins I hadn't considered. But I have been working with Terraform and AWS for a very long time.
If you're unfamiliar with AWS multi-account best practices this is a good read.
https://aws.amazon.com/organizations/getting-started/best-pr...
I don't know how integrating this into an environment where you already have tons of AWS accounts would go but it's interesting. Thankfully I only have to make new accounts when we greenfield a service and that's maybe a yearly thing.
I kind of think using a language with native JSON support and structural type system would be best.
HCL also just works.
Hashicorp is focused on CI/CD/cloud/workspace driven workflows over monorepo `chdir` driven.
// shout out to AWS CAB alums
What would be interesting here would be to see how they actually reference the outputs from one layer onto the next layer. That is something that is not even solved nicely in terragrunt and one of the major annoyances for me there. Using dependencies and the mock_output option is creating lots of noise in the plan outputs as the dependencies are only completely resolved when terragrund applies all the modules.
But it seems I also missed a few additions to terraform - so probably there are better ways to take outputs from one terraform run into another one.
1000% agree. I put together my version of standing up remote state in AWS into a Github repo. https://github.com/aryounce/terraform-aws-bootstrap
Our use of Terraform splits state exactly as described primarily to keep the state refresh times reasonable.
We've been using -target for years, and while I understand very well the reasons it's discouraged, it is pretty much the only way you can have your cake and eat it too with respect to having "one large terraform project" and not running into terraform refreshes that make your eyes bleed.
You end up having to really understand your module structure to use it, but it let us develop some very elegant workflows around tasks like patching.
We developed a ruby code base that utilizes rake and the hcl2json tool to automate terraform-based infrastructure workflows, using various libraries to handle and validate that applications are happy with what terraform is doing to their servers while it works.
This gives us flexibility to run automations safely against a terraform code base that has been evolving since version 0.3 or so, before most of the mistakes were made often enough to come up with the best practices we have today.
The point of Terraform is to have configuration in version control not to have a giant unmanageable state file.
Under one state, Terraform will spot issues here during `plan`, but with many states issues will only appear after `apply`.