Terraform best practices for reliability at any scale (opens in new tab)

(substrate.tools)

322 pointsholoway2y ago152 comments

152 comments

102 comments · 24 top-level

rcarr2y ago· 22 in thread

Genuine question for DevOps people:

Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are. I haven't delved into Terraform yet but I tried doing a project in Pulumi once and felt by the end of it that I might as well have just wrote it in AWS CDK directly.

abrookewood2y ago

After you have waited 20 minutes for CloudFormation to fail and tell you that it can't delete a resource (but won't tell you why), and this is the third time it has happened in a week, you start looking at alternatives.

androidbishop2y ago

This 1000%. Also recently discovered Google Deployment Manager is shit for the exact same reasons. I honestly don't get it.

solatic2y ago

> your particular cloud provider

Just this week, I wrote a Terraform module that uses the GCP, Kubernetes, and Cloudflare providers to allow us to bring up a single business-need (that will be needed, hopefully, many times in the future) that spans those three layers of the stack. 200 lines of Terraform written in an afternoon replaced a janky 2,000 line over-engineered Python microservice (including much retry and failure-handling logic that Terraform gives you for free) whose original author (not DevOps) moved on to better pastures.

CDK is fine if you're all-in on AWS. It has its tradeoffs compared to Terraform. Pick the right tool for the job.

Centigonal2y ago

I worked closely with the folks that wrote our platform's IaC, first in CDK, then in Terraform. I wrote a bit of CDK and zero TF myself, but here are some of the reasons we switched:

A big plus is that Terraform works outside of AWS land.

CDK is a nightmare to work with. You're writing with programming-language syntax, which tempts you to write dynamic stuff - but everything still compiles down to declarative CFN, which just makes the ergonomics feel limited. The L2 and L3 constructs have a lot of implicit defaults that came back to bite us later.

With CDK you get synth and deploy, which felt like a black box. Minor changes would do the same 8 minute long deploy process as large infrastructure refactors. Switching to TF significantly sped up our builds for minor commits. There might be a better way to do this with CDK (maybe deploying separate apps for each part of our infrastructure) and we may have just missed it.

androidbishop2y ago

Terraform, and by extension HCL, is more powerful and flexible. It can be used across clouds. It has providers for all kinds of things, like kubernetes. It can be abstracted and modularized. It supports cool features like workspaces and junk, depending on how you want to use it.

Also recently I was forced to use Google Cloud Deployment Manager scripts for some legacy project we were migrating to Terraform, and I was shocked at how buggy and useless it was. Failed to create resources for no discernible reason, couldn't update existing resources with dependencies, couldn't delete resources, was just unfathomably shit all around. Finished the Terraform migration earlier this morning and everything went off without a hitch, plus we got more coverage for stuff Deployment Manager doesn't support. It's also organized much nicer now, with versioned modules and what-have-you.

Cloudformation is ugly and again, surprisingly isn't well supported by AWS. I don't understand how it's possible, but terraform providers seem to be more up to date with products and APIs. Maybe that's just me but I've seen others complain about the same thing.

wodenokoto2y ago

Isn’t google cloud deployment just bash calls to the google cloud cli disguised as declarations by way of yaml?

1 more reply

yellowapple2y ago

- In the event that you are working with different cloud providers, Terraform is one thing to learn that then applies to all of them, as opposed to learning each provider's bespoke infra-as-code offering. Most companies stick to one PaaS/IaaS, but individual personnel ain't necessarily as limited over the courses of their careers.

- Not all cloud providers have an infra-as-code offering of their own in the first place (especially true with traditional server hosts), whereas pretty much every provider with some sort of API most likely has a Terraform provider implemented for it.

- Terraform providers include more than just PaaS/IaaS providers / server hosts; for example, my current job includes provisioning Datadog metrics and PagerDuty alerts alongside applications' AWS infra in the same per-app Terraform codebase, and a previous job entailed configuring Keycloak instances/realms/roles/etc. via Terraform.

androidbishop2y ago

Also pretty neat that there's a Terraform provider for Kubernetes native resources.

1 more reply

RulerOf2y ago

I've got a lot of opinions here, but the only one I'll share is that HCL knocks the socks off of json and yaml. Json is too rigid. YAML is too nested. HCL gets this just right.

Venturing away from opinions, the provider ecosystem with terraform enables some wonderful design options. For example, I have a module template that takes some basic container configs (e.g. ports, healthchecks) and a GitHub URL, then stands the service up on ECS and configures CI in the linked repo. CF can't do that.

nuker2y ago

Im 10 years working with AWS. I strongly prefer Cloudformation, just separate things smartly between stacks. It has export/import for stack outputs too. Just look at the “root module” mess in this discussion and you’ll get why.

raffraffraff2y ago

For me personally, I chose terraform because it can work with AWS and a heap of other 3rd party services and software (Cloudflare, PostgreSQL, Keycloak, Kubernetes/Helm, Github, Azure)

x3n0ph3n32y ago

I have used both terraform and cloudformation substantially and they each have pros and cons. One thing terraform has over cloudformation is its rapid support for new services and features. AWS has done an awful job ensuring that cloudformation support is part of each team's definition of "done" for each release. It just doesn't get the support it really needs from AWS.

koolba2y ago

CloudFormation is the ugly step child of AWS. It has bugs that have languished for years

1 more reply

dgrin912y ago

Companies choose providers and tend to stick with them, but people don't always stick with companies. If I know TF there is a decent chance my skills will be applicable when I change companies.

Also some big corps run their own internal datacenters and have cloud-like interfaces with them. You can write TF providers for that (its not going to be as nice as the public cloud ones, but still nice). Then you can utilize Terraforms multi-provider functionality to have 1 project manage deployments on multiple clouds that include on-prem.

Also terraforms multi-provider functionality is also useful for non aws/azure/gcp such as Cloudflare. As far as I know CDK does not support that.

Illotus2y ago

> Other than the fact it seems to be an industry standard so it's good for your job prospects, what are the benefits to Terraform over CloudFormation/CDK or whatever the equivalent is for your particular cloud provider?

For me the killer feature is that both plan and apply show the actual diff of changes vs running infrastructure. It makes understanding effects of changes much easier.

Bellyache52y ago

Agreed, Terraform does a good job of this. But CloudFormation & CDK can also do this via Change Sets and CDK diff.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

https://blog.mikaeels.com/what-does-the-aws-cdk-diff-command...

thedougd2y ago

Providers I regularly use, even mixed in a single project. There are others I could use if they were available.

AWS GitHub Opsgenie Okta Scalr TLS DNS

koolba2y ago

You forgot the greatest escape hatch of all: null

jahsome2y ago

Third-party integrations and the universality/reusability across multiple products and familiarity of HCL are big for me.

333throwaway3422y ago

> Most companies/people pick a provider and then stick with it and it doesn't seem like there's much portability between configurations if you do decide to switch providers later down the line so I'm not sure what the benefits are.

This smells like kubernetes

> Terraform over CloudFormation/CDK

They both work. It's more about which providers you need.

maccard2y ago

We don't just have AWS resources. Our CI pipelines are managed by terraform [0], they communicate with GitHub [1]. I like that it's declarative and limited, it stops people trying to do "clever shit" with our infra, which is complicated enough as it is.

[0] https://buildkite.com/blog/manage-your-ci-cd-resources-as-co...

[1] https://registry.terraform.io/providers/integrations/github/...

cwp2y ago

It's subtle and so difficult to see the differences at smaller scales. If you're going to provision a handful of EC2 instances, all the tools work fine.

I think HCL is an under appreciated aspect of Terraform. It was kinda awful for a while, but it's gotten a lot better and much easier to work with. It hits a sweet spot between data languages like JSON and YAML and fully-general programming languages like Python.

Take CloudFormation. The "native" language is JSON, and they've added YAML support for better ergonomics. But JSON is just not expressive enough. You end up with "pseudoparameters" and "function calls" layered on top. Attribute names doubling as type declarations, deeply nested layers of structure and incredible amounts of repetitious complexity just to be able to express all the details need to handle even moderate amounts of infrastructure.

So, ok, AWS recognizes this and they provide CDK so you can wring out all the repetion using a real programming language - pick your favourite one, a bunch are supported. That helps some, but now you've got the worst of both worlds. It's not "just JSON" anymore. You need a full programming environment. The CDK, let's say the Python version, has to run on the right interpreter. It has a lot of package dependencies, and you'll probably want to run it in a virtualenv, or maybe a container. And it's got the full power of Python, so you might have sources of non-determinism that give you subtle errors and bugs. Maybe it's daylight saving gotchas or hidden dependencies on data that it pulls in from the net. This can sound paranoid, but these things do start to bite if you have enough scale and enough time.

And then, all that Python code is just a front end to the JSON, so you get some insulation from it, but sometimes you're going to have to reason about the JSON it's producing.

HCL, despite its warts, avoids the problems with these extremes. It's enough of a programming language that you can just use named, typed variables to deal with configuration, instead of all the { "Fn::GetAtt" : ["ObjectName", "AttName"] } nonsense that CloudFormation will put you through. And the ability to create modules that can call each other is sooo important for wringing out all the repetition that these configurations seem to generate.

On the other hand, it's not fully general, so you don't have to deal with things like loops, recursion, and so on. This lack of power in the language enables more power in the tools. Things like the plan/apply distinction, automatically tracking dependencies between resources, targeting specific resources, move blocks etc. would be difficult or impossible with a language as powerful as Python.

HCL isn't the only language in this space - see CUE and Dhall, for example - but it's undoubtedly the most widely used. And it makes a real difference in practice.

xyzzy1232y ago· 14 in thread

Here's my #1 tip, most important:

Try to keep your stateful resources / services in different "stacks" than your stateful things.

Absolutely 100% completely obvious, maybe too obvious? Because none of these guides ever mention it.

If you have state in a stack it becomes 10x more expensive and difficult to "replace your way out of trouble" (aka destroy and recreate as last resort). You want as much as possible in stateless, disposable stacks. DONT put that customer exposed bucket or DB in the same state/stack as app server resources.

I don't care about your folder structure, I care about what % of the infra I can reliably burn to the ground and replace using pipelines and no manual actions.

coryrc2y ago

You mean keep stateless separate from stateful?

Everyone else seems to be reading over the typo or I'm more confused than I thought.

wahnfrieden2y ago

Yes

sverhagen2y ago

Is a "stack" here a (root) folder on which you'd do a "terraform apply"? I've never know what to call those, surely they aren't "modules".

And, so, you're saying: try to have a separate deployment (stack then?) that contains the state, so you can wipe away everything else if you want to, without having to manage the state?

xyzzy1232y ago

It's not exactly about the folder, the IaC from a single folder / project can be instantiated in multiple places. Each time you do that, it has a unique state file, so I usually hear it referred to as a "state". In cfn you can similarly deploy the same thing lots of times and each instantiation is called a "stack", so stack/state tend to get used inter-changeably.

And yes, that's a succinct rephrasing.

When you first use iac it maybe seems logical to put your db and app server in the same "thing" (stack or state file) but now that thing is "pet like" and you have to take care of it forever. You can't safely have a "destroy" action in your pipeline as a last resort.

If you put the stateful stuff in a separate stack you can freely modify the things in the stateless one with much less worry.

3 more replies

raffraffraff2y ago

I have the same issue. They're all modules, but the ones at the tip of the directory tree (right at the end of your env/region/stack) are called root modules. Which makes no sense because the term "root" always implies that they are at the beginning, not the tippy-toe end. So I call mine "stacks". But as another answer suggested, "states" is also fine. Even though the actual state isn't inside that directory, it's probably in an object store.

At the end of the day I don't care what other people call them.

GauntletWizard2y ago

I have adopted the term "Root Module" vs "Submodule" because those line up with terraform's own definitions, but I agree that they're terribly, terribly named.

thwway234322y ago

The "stack" nomenclature used here is jarring since it is unrepresented in Terraform HCL literature.

A CDK stack, (assuming that's what is used here), would be loosely equivalent to a Terraform HCL module.

robertlagrant2y ago

Makes sense, but how do you connect the two so e.g. credentials from one are surfaced in the other?

dharmab2y ago

Use Data Sources to reference resources in a different state: https://developer.hashicorp.com/terraform/language/data-sour...

1 more reply

no_circuit2y ago

IMO you shouldn't be storing credentials in shared state, as suggested by the other comments, since that means that the principals able to read the state to deploy their service can also read the credentials for other services bundled in that state file. This could be the case if one had broken down the root modules into scopes/services like the linked page suggests.

It is reasonable to assume if you are using Terraform to manage your infra, than your infra likely has access to a secrets manager from your infra vendor, e.g., AWS. Instead I'd recommend using a Terraform data resource to pull a credential from the secret manager by name -- and the name doesn't even necessarily have to be communicated through Terraform state. Then the credentials could directly be fed into where it is needed, e.g., a resource like a Kubernetes Secret. One can even skip this whole thing if the service can use the secret manager api itself. Finally access to the credentials itself would be locked down with IAM/RBAC.

paulddraper2y ago

terraform_remote_state

The root module can have outputs just like any other module. These outputs can be accessed from other stacks from the backend.

And if you use CDKTF the references are handled transparently.

1 more reply

harha_2y ago

I've never used terraform, but I have used CloudFormation and AWS CDK. It's been a while though, is there a clear indication on the major cloud provider docs which resources are stateful? Or is it always obvious?

tyingq2y ago

Difficult question, as people mean different things when they say state. One example might be a relatively simple AWS Lambda. Most people would say that's easily stateless.

But, what if that Lambda depends on a VPC with some specific networking config to allow it to connect to some partner company private network? And, it's difficult to recreate that VPC without service disruption for a variety of reasons that are out of your control. Well, now you have state because you need to track which existing VPC the Lambda needs if you tear the Lambda down and recreate it.

1 more reply

hansoolo2y ago

Just stopped here because I had said XYZZY way too often in the last three hours xD

mdhb2y ago· 12 in thread

Sorry if this comes across weird or snotty it’s not supposed to.

But I’m coming at this from a GCP lens and got half way through the article about how the recommended unit of isolation in the AWS environment is entirely different AWS accounts and I’m kind of hung up on that. Is that really a thing people tend to do often? Doesn’t it get super unwieldy? How does billing work? What about identity? I have so many questions.

EDIT: Despite the fact that the root resource in both GCP and AWS is an organization, when I heard “account” I mistook that to be AWS terminology for an organization.

swozey2y ago

The way this works with AWS is similar to you making a GCP project.

At the top level you have an organization account, which is where billing occurs.

From this org account you create accounts for the following (typically):

1. Security - AKA the account your USERS are in 2. Ops - The account your monitoring, etc are in

From here where a lot of people seem to deviate (I've been interviewing level 2-3 SREs for the last 3 weeks and have heard all about different AWS structures that I don't like) is how to break up your applications into their own accounts for a low blast radius.

What I DO, and is well known as being the best practice, is to create an AWS account for each environment of each application.

App1-sandbox App1-staging App1-production

Then your terraform is also structure by application/environment/service. Each environment and application has it's own state in s3 and dynamodb.

And so on.

Is this unwieldly? I have 40-50 AWS accounts and no it's not unwieldly at all IMO. Cross account IAM and trust relationships are set up very early on and they don't need to be modified much if any at all until you create another AWS account. Creating a new AWS account is kind of annoying, though. I need to automate that process better.

https://aws.amazon.com/organizations/getting-started/best-pr...

mdhb2y ago

Cool, that was a genuinely fascinating window into AWS for me. Thank you for sharing

1 more reply

denvrede2y ago

How do you deploy your Apps? We exclusively use EKS and having one account per env and app seems like quite an overhead when I think about managing / updating EKS clusters for each one. It also comes with an overhead of base applications that need to run in each cluster by default (like cert-manager, externaldns etc).

Right now we’re using one account per env but also see downsides and thought of going the next step to do one account per env and tribe/team.

2 more replies

infogulch2y ago

> Creating a new AWS account is kind of annoying, though. I need to automate that process better.

You can do that with terraform...

1 more reply

dragonwriter2y ago

> But I’m coming at this from a GCP lens and got half way through the article about how the recommended unit of isolation in the AWS environment is entirely different AWS accounts and I’m kind of hung up on that. Is that really a thing people tend to do often? Doesn’t it get super unwieldy?

There are AWS systems above the account level for managing it (Organizations), so its not quite as bad as it might naively seem, but, yes, its more unwieldy than GCP’s projects.

mdhb2y ago

Oh thank god, that’s much better than I naively thought. Thanks for the heads up.

stock_toaster2y ago

You can have sub-accounts that roll up billing to a main account. Still messy, but probably cleaner (security policy wise) and possibly safer (fewer production impacting accidental config changes?) than having a single giant account with _lots_ of things mixed together.

mdaniel2y ago

As the resident pedant, one cannot have "sub-accounts" in AWS. One can 100% have Accounts that are a member of an AWS Organization, which itself has a Management Account that does as you describe as the "main account", but there is no parent-child relationship between Accounts, only OUs and Accounts or OUs and other OUs (which the Organization, itself, counts as an OU)

1 more reply

erik_seaberg2y ago

Some AWS services have per-account (not per-resource) size and rate limits. Keeping resources in separate AWS accounts gets them out of each others’ blast radii.

AWS IAM doesn’t do capability inheritance; if I can write a policy at all I could grant any privilege to any resource in that policy, even privileges I don’t personally have. It’s easier for each groups of admins to have a separate AWS account than to put everyone in security boundaries that try to wall them off.

linuxdude3142y ago

This doesn’t have anything to do with AWS, this has to do with terraform not allowing you to dynamically instantiate providers.

The same thing happens with the Kubernetes provider when you try to use it with multiple GKE clusters.

lgreiv2y ago

You should be able to organize accounts hierarchically using AWS Organizations, which allows to have cost centers and centralized billing (and some imposed policies over all accounts).

cube22222y ago

It's extremely common and recommended.

Billing works by having a billing aws account that all other accounts are in a sense "children" of.

thunfisch2y ago· 5 in thread

We're using Terragrunt with hundreds of AWS accounts and thousands of Terraform deployments/states.

I'll never want to do this without Terragrunt again. The suggested method of referencing remote states, and writing out the backends will fall apart instantly at that scale. It's just way too brittle and unwieldy.

Terragrunt with some good defaults that will be included, and separated states for modules (which makes partial applies a breeze) as well as autogenerated backend configs (let Terragrunt inject it for you, with templated values) is the way to go.

rvdginste2y ago

We use a setup where we have multiple repos with Terraform configuration and thus multiple Terraform states. We then use Terraform remote state to link everything together. I am talking about 10-20 repos and states. Orthogonal to that, we use multiple workspaces to describe the infra in different environments.

The problems I have personally experienced with this approach are:

- if you update one of the root Terraform states, you need to execute a Terraform apply for every repo that depends on that Terraform state; developers do not do that because either they forget or they do know but are too lazy and subsequently are surprised that things are broken

- if you use workspaces for maintaining the infra in different environments, and certain components are only needed in specific environments, then the Terraform code becomes pretty ugly (using count which makes a single thing suddenly a list of things, which you then have to account for in the outputs which becomes very verbose)

Is Terragrunt something that would help us? I do not know Terragrunt, and a quick look at the website did not make that clear for me.

ckdarby2y ago

Have you spent any time with Pulumi?

I've kind of found terraform is dying and encourages a lot of bad practices but everyone agrees with them because HCL and it is transferable as most companies are just using TF.

RulerOf2y ago

> I've kind of found terraform is dying

I don't think it's dying. The hype has worn off. Everybody uses it. It's very mature. There's a module for everything.

It's just not new and sexy anymore IMO.

2 more replies

DelightOne2y ago

Do you need to chain multiple Terragrunt executions to first bring the Kubernetes cluster up and then the containers, or does Terragrunt fix that?

miduil2y ago

Yes, with terragrunt you can do a `terragrunt run-all apply` and based on `output` to `variable` in each module data can be passed from one state/module to the next one, terragrunt knows how to run them in the right order so you can bootstrap your EKS cluster by having one module which bootstraps the account, then another one which bootstraps EKS, then one that configures the cluster, installs your "base pods" and then later everything else.

spicyusername2y ago· 5 in thread

Everybody in here is recommending tarragrunt, but I'm not sure what value it provides over regular terraform.

After using it for a few months all of the features found in tarragrunt are in terraform.

jbjohns2y ago

This is my impression as well. As far as I've understood, terragrunt was made back when terraform was missing a lot of key features (I think it maybe didn't even have modules yet) but when I was asked to evaluate it recently for a client I couldn't find a single reason to justify adding another tool.

linuxdude3142y ago

The primary thing terragrunt was designed to do was let you dynamically render providers.

Terraform still does not let you this.

It becomes very problematic when using providers that are region specific, amongst other scenarios.

That being said I don’t like the extra complexity terragrunt adds and instead choose to adopt a hierarchical structure that solves most of the problems being able to dynamically render providers would solve.

Each module is stored in its own git repo.

Top layer or root module contains one tf file that is ONLY imports with no parameters.

The modules being imported are called “tenant modules”. A tenant module contains instantiations of providers and modules with parameters.

The nodules imported by the tenant modules at the ones that actually stand up the infrastructure.

Variables are used, but no external parameters files are used at any level (except for testing).

All of the modules are versioned with git tagged releases so the correct version can easily be imported.

Couple this with a single remote state provider in the root module and throw it in a CI/CD pipeline and you have a gitops driven infrastructure as code pipeline.

1 more reply

maccard2y ago

We migrated from terragrunt to terraform as we thought the same thing. I'm in half a mind to go back.

Managing multiple environments is much easier in TG. State management in TF is kneecapped by the lack of variable support in backend blocks. I can only assume it's to encourage people into using terraform cloud for state management.

yellowapple2y ago

Terragrunt shines in cases where you have independent sets of Terraform state, especially if they are dependencies/dependents of one another.

For example, say you're using Terraform to manage AWS resources, and you've provisioned an Active Directory forest that you in turn want to manage with Terraform via the AD provider. Terraform providers can't dynamically pull things like needed credentials from existing state, so you end up needing two separate Terraform states: one for AWS (which outputs the management credentials for the AD servers you've provisioned) and one for AD (which accepts those credentials as config during 'terraform init').

Terragrunt can do this in an automated way within a single codebase, redefining providers and handling dependency/dependent relationships. I don't know of a way to do it in pure Terraform that doesn't entail manual intervention.

nwmcsween2y ago

Ideally you decouple this and store the creds in a key vault or whatever, this way you have to explicitly grant access to the service principal to access the kv secret. Decoupling usually fixes other issues as well such as expiring creds from service a to service b will then get coded into terraform to refresh.

cube22222y ago· 4 in thread

The article recommends to split up your state files for various advantages, but also expands into how to manage it later in a custom way.

I agree with the splitting, but based on many home-grown automation systems I've seen around this I'd really recommend you to use one of the specialized CI/CD systems that are built around automating these kinds of workflows. Once you reach the "many state files" phase, you'll save a lot of engineering time this way.

They'll take care of, among others, running the right state files, in the right order, with the right parameters. But they'll also take care of many other things you need to run Terraform at scale and with big amounts of engineers (happy to expand but don't want to kitchen-sink this comment).

Disclaimer: Take this with a sensible grain of salt, as I work at Spacelift[0] - one of the TACOS (and of course the one I'll shamelessly link and recommend!).

But really, don't use tools like Jenkins for this as you scale, it'll likely hurt you in the long run.

[0]: https://spacelift.io

swozey2y ago

I'm sure that you have no control over this but I really wish Spacelift would increase the cost of its cloud tier and lower the cost of Ent. I'm in the anti-goldilocks zone. Ent seems priced for large teams when I practically fit into the cloud offering sans missing a few required features.

Great product though from what I've experienced.

sausagefeet2y ago

Disclaimer: Co-founder of Terrateam.

For Terrateam[0], we have probably 70% of the enterprise offering but at around 1/10th the price. If there are any features that are deal breaker, feel free to reach out to me and we'll see what we can do. That being said, Spacelift is a much more luxurious piece of software than us. We are very utilitarian, but we have to rationalize that low price-point somehow.

[0] https://terrateam.io

carty72y ago

Hi swozey. Spacelift sales leader here. Let's have a conversation and I'll work with you to find the goldilocks zone that you are looking for. Grab a demo with us and mention this post and my name "Ryan". We can dive into the features you require.

cube22222y ago

Sorry to hear that! Pricing is hard.

If you haven't yet, please try talking to our sales team. There's usually a way to make all sides happy with some custom agreements - after all, we'd love for you to be able to use our product as much as you need.

pezh0re2y ago· 4 in thread

This is a great read, but I always seem to run into cases where I need to define something like a security group and then reference it when deploying ec2 instances. I'd love to decouple to reduce my plan time, but I haven't figured a way out as of yet.

To be fair, I haven't used terraform -chdir yet.

c0Re692y ago

Try Terragrunt https://terragrunt.gruntwork.io/docs/

OJFord2y ago

-chdir is useful, but nothing to do with this (it's literally just cd before running command).

As sibling said, use data source to read remote state outputs.

36chamber2y ago

not sure what you mean here.

Do you use the same security group for all of your instances?

i am usually creating a security group per group of related ec2 instances.

JohnMakin2y ago

you can pull it in via a data source, but then of course this creates a coupling between multiple modules/state files.

time0ut2y ago· 3 in thread

I’ve been using Terragrunt [0] for the past three years to manage loosely coupled stacks of Terraform configurations. It allows you to compose separate configurations almost as easily as you compose modules within a configuration. Its got its own learning curve, but its a solid tool to have in the tool box.

Gruntwork is a really cool company that makes other tools in this space like Terratest [1]. Every module I write comes with Terratest powered integration tests. Nothing more satisfying than pushing a change, watching the pipeline run the test, and then automatically release a new version that I know works (or at least what I tested works).

[0] https://terragrunt.gruntwork.io/

[1] https://terratest.gruntwork.io/

mike_d2y ago

They seem very insistent on keeping things DRY but not explaining why. Does Terraform tend to cause water leaks?

raffraffraff2y ago

Terraform is supposed to let you write modular, reusable code. But because it's a limited DSL that lacks many "proper language" features (and occasionally breaks the rule of least-suprise). There are several major impediments to fully data-driven terraform. These ultimately result in copy/paste code, or tools like terragrunt which essentially wrap terraform and perform the copy/pasta behind your back by generating that code for you.

Some minor examples:

- calling a module multiple times using `for_each` to iterate over data works, except if the module contains a "provider" block

- if you are deploying two sets of resources by iterating over data, terraform can detect dependency cycles where there are not any

SgtBastard2y ago

DRY = Don’t Repeat Yourself.

badblock2y ago· 3 in thread

Some of this seems like old advice, instead of having directories per environment you should be using workspaces to keep your environments consistent so you don't forget to add your new service to prod.

rcrowley2y ago

(Hi, I’m one of the authors of the article at the root of this thread.)

I’ve gone back and forth on workspaces versus more root modules. On balance, I like having more root modules because I can orient myself just by my working directory instead of both my working directory and workspace. Plus, I feel better about stuffing more dimensions of separation into a directory tree than into workspace names. YMMV.

36chamber2y ago

Do you always store modules in the same repo as the terraform itself?

Why not put them in seperate repos that can be tagged and versioned and then referenced like below?

source = "git::https://bitbucket.org/foocompany/module_name.git?ref=v1.2"

1 more reply

dharmab2y ago

What do you think about multiple backends? It seems to be working well for me to have a single root module but with a separate backend configuration per environment.

2 more replies

swozey2y ago· 2 in thread

This was a good read but really if you already follow the common best practices of IAC/terraform/aws multi-account I don't think you're going to learn much.

The comments in here kind of made me think I was going to hop in and take away some huge wins I hadn't considered. But I have been working with Terraform and AWS for a very long time.

If you're unfamiliar with AWS multi-account best practices this is a good read.

https://aws.amazon.com/organizations/getting-started/best-pr...

bcjordan2y ago

I remember periodically coming across services/platforms that purport to make setting up secure AWS accounts / infra configuration easier and default secure — anyone know what I may be thinking of?

swozey2y ago

Actually the article here is one of those options - https://substrate.tools/

I don't know how integrating this into an environment where you already have tons of AWS accounts would go but it's interesting. Thankfully I only have to make new accounts when we greenfield a service and that's maybe a yearly thing.

aloknnikhil2y ago· 1 in thread

It seems to me, this is trying really hard to shoehorn Terraform into managing at scale. For multi-account, multi-org, multi-region, multi-cloud deployments is Terraform really supposed to be the state of the art? How do you even get visibility into the various deployment workflows?

Too2y ago

What’s the alternative?

bickfordb2y ago· 1 in thread

Aside from reducing the blast radius of any Terraform state (split by envs, then by teams as you grow), I highly recommend using cdktf with Python for Terraform projects. Huge timesaver

333throwaway3422y ago

I don't know.

I kind of think using a language with native JSON support and structural type system would be best.

HCL also just works.

abledon2y ago· 1 in thread

Why doesn't Hashicorp provide official best practices like this?

333throwaway3422y ago

This isn't a rated E for everyone practice.

Hashicorp is focused on CI/CD/cloud/workspace driven workflows over monorepo `chdir` driven.

paulddraper2y ago· 1 in thread

What does the phrase "stamp out" mean in this context?

jbjohns2y ago

Rapidly create exact duplicates I think.

waffletower2y ago

While combining the word "best with "Terraform" in a sentence is more than likely to result in an oxymoron, it is counter-productive not to attempt to organize and utilize terraform as elegantly and DRY as possible. We interact with stacks (which we call projects typically) via Terragrunt and have a very large surface of modules as we do have a fair amount of infrastructure pieces. But we also try to expose Terraform infrastructure changes by use of Atlantis; though bulky, github does provide a reasonable means to dialogue and manage changes made by multiple teams. The use of modules also helps us encapsulate infrastructure, and state problems are rare with these approaches, but the data sprawl inherent to Terraform is very unwieldy regardless of so called "best" practices. The language features are weak, awkward and directly encourage repetition and specification bloat. We have had some success via Data Sources to export logic outside of Terraform and provide much needed sanity when interacting with very verbose infrastructure such as Lake Formation.

Terretta2y ago

This should be mandatory reading for anyone doing IaC, using TF and AWS or not, less for how you do it, more for what and why.

// shout out to AWS CAB alums

gerl1ng2y ago

The solution at the end almost looks like the manual setup of terragrunt which we are using to manage lots of base infra in many different accounts.

What would be interesting here would be to see how they actually reference the outputs from one layer onto the next layer. That is something that is not even solved nicely in terragrunt and one of the major annoyances for me there. Using dependencies and the mock_output option is creating lots of noise in the plan outputs as the dependencies are only completely resolved when terragrund applies all the modules.

But it seems I also missed a few additions to terraform - so probably there are better ways to take outputs from one terraform run into another one.

ary2y ago

> At scale, many Terraform state files are better than one. But how do you draw the boundaries and decide which resources belong in which state files? What are the best practices for organizing Terraform state files to maximize reliability, minimize the blast-radius of changes, and align with the design of cloud providers?

1000% agree. I put together my version of standing up remote state in AWS into a Github repo. https://github.com/aryounce/terraform-aws-bootstrap

Our use of Terraform splits state exactly as described primarily to keep the state refresh times reasonable.

RulerOf2y ago

> Using the -target option to terraform plan is discouraged (the Terraform documentation says, “this is for exceptional use only”). Anyway, it’s likely to lead to confusing infrastructure states if changes are applied incrementally with ad hoc, situational boundaries.

We've been using -target for years, and while I understand very well the reasons it's discouraged, it is pretty much the only way you can have your cake and eat it too with respect to having "one large terraform project" and not running into terraform refreshes that make your eyes bleed.

You end up having to really understand your module structure to use it, but it let us develop some very elegant workflows around tasks like patching.

We developed a ruby code base that utilizes rake and the hcl2json tool to automate terraform-based infrastructure workflows, using various libraries to handle and validate that applications are happy with what terraform is doing to their servers while it works.

This gives us flexibility to run automations safely against a terraform code base that has been evolving since version 0.3 or so, before most of the mistakes were made often enough to come up with the best practices we have today.

datadeft2y ago

I settled on 1 subfolder, 1 “stack” (stage/app, for example dev/login/frontend). This gives us fast deploy time and easy and painless way to destroy/re-create. Databases could go a separate folder if state if we had any.

The point of Terraform is to have configuration in version control not to have a giant unmanageable state file.

36chamber2y ago

I'm surprised the blog advocates embedded modules instead of remote ones stored in seperate git repos. This allows you to tag and version them, and therefore progressively update modules.

smetj2y ago

The problem with TF is that it lures people into trying to be smart and try to be beat the system after which things often become a personal challenge instead of a business requirement. A true nightmare for the next person in line. Also ... every declaritive language dreams of becoming a programming language ...

de_keyboard2y ago

If many different "states" are used for one big architecture, how are the boundaries between them managed?

Under one state, Terraform will spot issues here during `plan`, but with many states issues will only appear after `apply`.

ochoseis2y ago

We’ve had a pretty good experience with Terraspace at work, which is an opinionated framework/layout for Terraform. It supports hooks and splitting state between regions and accounts.

j / k navigate · click thread line to collapse

152 comments

102 comments · 24 top-level

rcarr2y ago· 22 in thread

Genuine question for DevOps people:

abrookewood2y ago

androidbishop2y ago

This 1000%. Also recently discovered Google Deployment Manager is shit for the exact same reasons. I honestly don't get it.

solatic2y ago

> your particular cloud provider

CDK is fine if you're all-in on AWS. It has its tradeoffs compared to Terraform. Pick the right tool for the job.

Centigonal2y ago

I worked closely with the folks that wrote our platform's IaC, first in CDK, then in Terraform. I wrote a bit of CDK and zero TF myself, but here are some of the reasons we switched:

A big plus is that Terraform works outside of AWS land.

androidbishop2y ago

wodenokoto2y ago

Isn’t google cloud deployment just bash calls to the google cloud cli disguised as declarations by way of yaml?

1 more reply

yellowapple2y ago

androidbishop2y ago

Also pretty neat that there's a Terraform provider for Kubernetes native resources.

1 more reply

RulerOf2y ago

I've got a lot of opinions here, but the only one I'll share is that HCL knocks the socks off of json and yaml. Json is too rigid. YAML is too nested. HCL gets this just right.

nuker2y ago

raffraffraff2y ago

For me personally, I chose terraform because it can work with AWS and a heap of other 3rd party services and software (Cloudflare, PostgreSQL, Keycloak, Kubernetes/Helm, Github, Azure)

x3n0ph3n32y ago

koolba2y ago

CloudFormation is the ugly step child of AWS. It has bugs that have languished for years

1 more reply

dgrin912y ago

Companies choose providers and tend to stick with them, but people don't always stick with companies. If I know TF there is a decent chance my skills will be applicable when I change companies.

Also terraforms multi-provider functionality is also useful for non aws/azure/gcp such as Cloudflare. As far as I know CDK does not support that.

Illotus2y ago

For me the killer feature is that both plan and apply show the actual diff of changes vs running infrastructure. It makes understanding effects of changes much easier.

Bellyache52y ago

Agreed, Terraform does a good job of this. But CloudFormation & CDK can also do this via Change Sets and CDK diff.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

https://blog.mikaeels.com/what-does-the-aws-cdk-diff-command...

thedougd2y ago

Providers I regularly use, even mixed in a single project. There are others I could use if they were available.

AWS GitHub Opsgenie Okta Scalr TLS DNS

koolba2y ago

You forgot the greatest escape hatch of all: null

jahsome2y ago

Third-party integrations and the universality/reusability across multiple products and familiarity of HCL are big for me.

333throwaway3422y ago

This smells like kubernetes

> Terraform over CloudFormation/CDK

They both work. It's more about which providers you need.

maccard2y ago

[0] https://buildkite.com/blog/manage-your-ci-cd-resources-as-co...

[1] https://registry.terraform.io/providers/integrations/github/...

cwp2y ago

It's subtle and so difficult to see the differences at smaller scales. If you're going to provision a handful of EC2 instances, all the tools work fine.

And then, all that Python code is just a front end to the JSON, so you get some insulation from it, but sometimes you're going to have to reason about the JSON it's producing.

HCL isn't the only language in this space - see CUE and Dhall, for example - but it's undoubtedly the most widely used. And it makes a real difference in practice.

xyzzy1232y ago· 14 in thread

Here's my #1 tip, most important:

Try to keep your stateful resources / services in different "stacks" than your stateful things.

Absolutely 100% completely obvious, maybe too obvious? Because none of these guides ever mention it.

I don't care about your folder structure, I care about what % of the infra I can reliably burn to the ground and replace using pipelines and no manual actions.

coryrc2y ago

You mean keep stateless separate from stateful?

Everyone else seems to be reading over the typo or I'm more confused than I thought.

wahnfrieden2y ago

Yes

sverhagen2y ago

Is a "stack" here a (root) folder on which you'd do a "terraform apply"? I've never know what to call those, surely they aren't "modules".

And, so, you're saying: try to have a separate deployment (stack then?) that contains the state, so you can wipe away everything else if you want to, without having to manage the state?

xyzzy1232y ago

And yes, that's a succinct rephrasing.

If you put the stateful stuff in a separate stack you can freely modify the things in the stateless one with much less worry.

3 more replies

raffraffraff2y ago

At the end of the day I don't care what other people call them.

GauntletWizard2y ago

I have adopted the term "Root Module" vs "Submodule" because those line up with terraform's own definitions, but I agree that they're terribly, terribly named.

thwway234322y ago

The "stack" nomenclature used here is jarring since it is unrepresented in Terraform HCL literature.

A CDK stack, (assuming that's what is used here), would be loosely equivalent to a Terraform HCL module.

robertlagrant2y ago

Makes sense, but how do you connect the two so e.g. credentials from one are surfaced in the other?

dharmab2y ago

Use Data Sources to reference resources in a different state: https://developer.hashicorp.com/terraform/language/data-sour...

1 more reply

no_circuit2y ago

paulddraper2y ago

terraform_remote_state

The root module can have outputs just like any other module. These outputs can be accessed from other stacks from the backend.

And if you use CDKTF the references are handled transparently.

1 more reply

harha_2y ago

tyingq2y ago

Difficult question, as people mean different things when they say state. One example might be a relatively simple AWS Lambda. Most people would say that's easily stateless.

1 more reply

hansoolo2y ago

Just stopped here because I had said XYZZY way too often in the last three hours xD

mdhb2y ago· 12 in thread

Sorry if this comes across weird or snotty it’s not supposed to.

EDIT: Despite the fact that the root resource in both GCP and AWS is an organization, when I heard “account” I mistook that to be AWS terminology for an organization.

swozey2y ago

The way this works with AWS is similar to you making a GCP project.

At the top level you have an organization account, which is where billing occurs.

From this org account you create accounts for the following (typically):

1. Security - AKA the account your USERS are in 2. Ops - The account your monitoring, etc are in

What I DO, and is well known as being the best practice, is to create an AWS account for each environment of each application.

App1-sandbox App1-staging App1-production

Then your terraform is also structure by application/environment/service. Each environment and application has it's own state in s3 and dynamodb.

And so on.

https://aws.amazon.com/organizations/getting-started/best-pr...

mdhb2y ago

Cool, that was a genuinely fascinating window into AWS for me. Thank you for sharing

1 more reply

denvrede2y ago

Right now we’re using one account per env but also see downsides and thought of going the next step to do one account per env and tribe/team.

2 more replies

infogulch2y ago

> Creating a new AWS account is kind of annoying, though. I need to automate that process better.

You can do that with terraform...

1 more reply

dragonwriter2y ago

There are AWS systems above the account level for managing it (Organizations), so its not quite as bad as it might naively seem, but, yes, its more unwieldy than GCP’s projects.

mdhb2y ago

Oh thank god, that’s much better than I naively thought. Thanks for the heads up.

stock_toaster2y ago

mdaniel2y ago

1 more reply

erik_seaberg2y ago

Some AWS services have per-account (not per-resource) size and rate limits. Keeping resources in separate AWS accounts gets them out of each others’ blast radii.

linuxdude3142y ago

This doesn’t have anything to do with AWS, this has to do with terraform not allowing you to dynamically instantiate providers.

The same thing happens with the Kubernetes provider when you try to use it with multiple GKE clusters.

lgreiv2y ago

You should be able to organize accounts hierarchically using AWS Organizations, which allows to have cost centers and centralized billing (and some imposed policies over all accounts).

cube22222y ago

It's extremely common and recommended.

Billing works by having a billing aws account that all other accounts are in a sense "children" of.

thunfisch2y ago· 5 in thread

We're using Terragrunt with hundreds of AWS accounts and thousands of Terraform deployments/states.

rvdginste2y ago

The problems I have personally experienced with this approach are:

Is Terragrunt something that would help us? I do not know Terragrunt, and a quick look at the website did not make that clear for me.

ckdarby2y ago

Have you spent any time with Pulumi?

I've kind of found terraform is dying and encourages a lot of bad practices but everyone agrees with them because HCL and it is transferable as most companies are just using TF.

RulerOf2y ago

> I've kind of found terraform is dying

I don't think it's dying. The hype has worn off. Everybody uses it. It's very mature. There's a module for everything.

It's just not new and sexy anymore IMO.

2 more replies

DelightOne2y ago

Do you need to chain multiple Terragrunt executions to first bring the Kubernetes cluster up and then the containers, or does Terragrunt fix that?

miduil2y ago

spicyusername2y ago· 5 in thread

Everybody in here is recommending tarragrunt, but I'm not sure what value it provides over regular terraform.

After using it for a few months all of the features found in tarragrunt are in terraform.

jbjohns2y ago

linuxdude3142y ago

The primary thing terragrunt was designed to do was let you dynamically render providers.

Terraform still does not let you this.

It becomes very problematic when using providers that are region specific, amongst other scenarios.

Each module is stored in its own git repo.

Top layer or root module contains one tf file that is ONLY imports with no parameters.

The modules being imported are called “tenant modules”. A tenant module contains instantiations of providers and modules with parameters.

The nodules imported by the tenant modules at the ones that actually stand up the infrastructure.

Variables are used, but no external parameters files are used at any level (except for testing).

All of the modules are versioned with git tagged releases so the correct version can easily be imported.

Couple this with a single remote state provider in the root module and throw it in a CI/CD pipeline and you have a gitops driven infrastructure as code pipeline.

1 more reply

maccard2y ago

We migrated from terragrunt to terraform as we thought the same thing. I'm in half a mind to go back.

yellowapple2y ago

Terragrunt shines in cases where you have independent sets of Terraform state, especially if they are dependencies/dependents of one another.

nwmcsween2y ago

cube22222y ago· 4 in thread

The article recommends to split up your state files for various advantages, but also expands into how to manage it later in a custom way.

Disclaimer: Take this with a sensible grain of salt, as I work at Spacelift[0] - one of the TACOS (and of course the one I'll shamelessly link and recommend!).

But really, don't use tools like Jenkins for this as you scale, it'll likely hurt you in the long run.

[0]: https://spacelift.io

swozey2y ago

Great product though from what I've experienced.

sausagefeet2y ago

Disclaimer: Co-founder of Terrateam.

[0] https://terrateam.io

carty72y ago

cube22222y ago

Sorry to hear that! Pricing is hard.

pezh0re2y ago· 4 in thread

To be fair, I haven't used terraform -chdir yet.

c0Re692y ago

Try Terragrunt https://terragrunt.gruntwork.io/docs/

OJFord2y ago

-chdir is useful, but nothing to do with this (it's literally just cd before running command).

As sibling said, use data source to read remote state outputs.

36chamber2y ago

not sure what you mean here.

Do you use the same security group for all of your instances?

i am usually creating a security group per group of related ec2 instances.

JohnMakin2y ago

you can pull it in via a data source, but then of course this creates a coupling between multiple modules/state files.

time0ut2y ago· 3 in thread

[0] https://terragrunt.gruntwork.io/

[1] https://terratest.gruntwork.io/

mike_d2y ago

They seem very insistent on keeping things DRY but not explaining why. Does Terraform tend to cause water leaks?

raffraffraff2y ago

Some minor examples:

- calling a module multiple times using `for_each` to iterate over data works, except if the module contains a "provider" block

- if you are deploying two sets of resources by iterating over data, terraform can detect dependency cycles where there are not any

SgtBastard2y ago

DRY = Don’t Repeat Yourself.

badblock2y ago· 3 in thread

rcrowley2y ago

(Hi, I’m one of the authors of the article at the root of this thread.)

36chamber2y ago

Do you always store modules in the same repo as the terraform itself?

Why not put them in seperate repos that can be tagged and versioned and then referenced like below?

source = "git::https://bitbucket.org/foocompany/module_name.git?ref=v1.2"

1 more reply

dharmab2y ago

What do you think about multiple backends? It seems to be working well for me to have a single root module but with a separate backend configuration per environment.

2 more replies

swozey2y ago· 2 in thread

This was a good read but really if you already follow the common best practices of IAC/terraform/aws multi-account I don't think you're going to learn much.

The comments in here kind of made me think I was going to hop in and take away some huge wins I hadn't considered. But I have been working with Terraform and AWS for a very long time.

If you're unfamiliar with AWS multi-account best practices this is a good read.

https://aws.amazon.com/organizations/getting-started/best-pr...

bcjordan2y ago

I remember periodically coming across services/platforms that purport to make setting up secure AWS accounts / infra configuration easier and default secure — anyone know what I may be thinking of?

swozey2y ago

Actually the article here is one of those options - https://substrate.tools/

aloknnikhil2y ago· 1 in thread

Too2y ago

What’s the alternative?

bickfordb2y ago· 1 in thread

Aside from reducing the blast radius of any Terraform state (split by envs, then by teams as you grow), I highly recommend using cdktf with Python for Terraform projects. Huge timesaver

333throwaway3422y ago

I don't know.

I kind of think using a language with native JSON support and structural type system would be best.

HCL also just works.

abledon2y ago· 1 in thread

Why doesn't Hashicorp provide official best practices like this?

333throwaway3422y ago

This isn't a rated E for everyone practice.

Hashicorp is focused on CI/CD/cloud/workspace driven workflows over monorepo `chdir` driven.

paulddraper2y ago· 1 in thread

What does the phrase "stamp out" mean in this context?

jbjohns2y ago

Rapidly create exact duplicates I think.

waffletower2y ago

Terretta2y ago

This should be mandatory reading for anyone doing IaC, using TF and AWS or not, less for how you do it, more for what and why.

// shout out to AWS CAB alums

gerl1ng2y ago

The solution at the end almost looks like the manual setup of terragrunt which we are using to manage lots of base infra in many different accounts.

But it seems I also missed a few additions to terraform - so probably there are better ways to take outputs from one terraform run into another one.

ary2y ago

1000% agree. I put together my version of standing up remote state in AWS into a Github repo. https://github.com/aryounce/terraform-aws-bootstrap

Our use of Terraform splits state exactly as described primarily to keep the state refresh times reasonable.

RulerOf2y ago

You end up having to really understand your module structure to use it, but it let us develop some very elegant workflows around tasks like patching.

datadeft2y ago

The point of Terraform is to have configuration in version control not to have a giant unmanageable state file.

36chamber2y ago

I'm surprised the blog advocates embedded modules instead of remote ones stored in seperate git repos. This allows you to tag and version them, and therefore progressively update modules.

smetj2y ago

de_keyboard2y ago

If many different "states" are used for one big architecture, how are the boundaries between them managed?

Under one state, Terraform will spot issues here during `plan`, but with many states issues will only appear after `apply`.

ochoseis2y ago

We’ve had a pretty good experience with Terraspace at work, which is an opinionated framework/layout for Terraform. It supports hooks and splitting state between regions and accounts.

j / k navigate · click thread line to collapse