Terraform States One or Many

Introduction⌗

Terraform is a infrastructure as code tool which you can run SaaS or self managed. It has been designed to be pluggable via providers. And many cloud providers have providers for terraform. Such as AWS, GCP and Azure Cloud. But not limited to them. Kubernetes and Cloudflare are also parties that have providers for terraform. See a full list here.

Terraform state⌗

Terraform knows only that what is in the terraform state. If resources are not in the state file, terraform will try to create those resources. And if they already exist the provider will let you know, and terraform fails.

The states are stored either locally, where you ran terraform init, or remote, depending on your backend configuration. For both terraform cloud or self managed you must specify the backend.

It is best practice to store the sate somewhere remote and secure, secure meaning really secure. Only terraform should modify the state. Performing manual edits, or via terraform state ..., is a dangerous game. And should only be performed if you know exactly what you are doing.

Big VS Small⌗

Big state⌗

The easiest way to use terraform is with one big state file. You have everything you need close by. You can reference resources across providers, projects, accounts and what not. But this will also pose another challenge. Things will slow down.

Pro’s:

Ease of use
Ease of access to other resources

Con’s:

Larger states have low performance
Blast radius is huge in case of mishaps

Small states⌗

A more complex way to use terraform is with many small states. Where one state file is managing only a handful of resources. Reducing the blast radius of potential mishaps. And improving your terraform performance. But this is also limiting you in the reach of resources that you might be dependent on. One way to solve this is by using a remote state data source. Potential drawback here is that this does not always work as you expect.

Pro’s:

Performance
Blast radius is small in case of mishaps

Con’s:

Complex to use
Lack of access to other resources
Backend configuration potentially complex

The in between⌗

During my use of terraform I’ve seen both approaches used. Where one would go with a large state. And others with many small ones. And for both approaches I saw that either one was not just right.

Use and abuse of the “one state” because simple, but is it?. And the Yea… I.. We… We have everything in loosely coupled components. And not a clear understanding why they have many small states.

Today I think we should use the best of both worlds. Have your baseline in one “bigger” state, and your additions in separate states. Where the baseline would be your core infrastructure. And the additions the several components you run.

Example:

Baseline:

VPC
Subnet
Gateways
Hosted zones (DNS)

Additions:

Kubernetes
Databases
Caching clusters
Serverless Applications

And your DNS records would be managed via external-dns in kubernetes. Or within the addition.

Please note that I did not want to use the word module here. Due to the fact that this could be confusing with Terraform Modules.

Migrating resources in states⌗

Migrating terraform resources from state A to state B is a tedious process. And even the smallest mistakes could have dire consequences. Does not even matter how experienced you are. Which I shall point out in a bit.

I’ve done several migrations of terraform resources and they all were successful. Biggest one I did was combining the baseline architecture for a AWS account creation tool of one of my past clients. This setup started out with about 20 terraform states, and the CI/CD pipeline Orchestrating this all was failing multiple times a day. These failures were most of the time related to cross state dependencies, not even due to a remote state data source, but due to the CI process failed on one particular account due to a timing issue of some sort. Each account was separated by use of terraform workspaces.

In the end we moved many resources for hundred accounts into a singular state file without losses. Going from many states, to one big state per account. Although this did increase the blast radius it also simplified the code. And in the end the CI/CD pipeline actually improved on performance and stability. And that was the goal.

Most recent one where I tore apart my own terraform state. I went from a big state containing all AWS, GCP and Cloudflare resources as described in post: My Cloud Infrastructure. Moved the resources to a many state setup, where each state contained resources of that particular account. And during the migration I screwed up 1 of 5 migrations that day. I migrated from a big to many smaller states.

If you make a mistake, and you loose a resource. Please keep in mind that this can happen. You can always import it back into your new state if the resource allows you to do so. IF it doesn’t, well good luck… It is then up to the infrastructure stack to allow recovery of such incident.

In my case the resource google_service_account_key does not allow a import. My CI/CD process takes care of key-rotations. Therefor easily to recover from a lost resource.

Tips⌗

Always take your time in migrations. Do not rush it. Double check your work.
Script / code your migration process. Manual moving resources is bad.
Do not do pre-mature optimisations. They’ll bite you later.
Be sensible in your state splitting.
Big or many small states. Either one is not a bad choice!
Do not take my word for it, I’m no expert. This is just what I learned.