Terraform
Terraform is one of the most used provisioning tools, and it has a great support of the community. We chose this tool over the others for several reasons: its popularity, because it is a declarative language, it doesn’t need any other element like an agent, and it uses and immutable infrastructure paradigm.
Over the years, community has been improving projects developed using this tool. Some techniques have been added and the structure have also been changed to improve both maintainability and usability. In particular ArCo’s architecture follows the principles that Yevgeniy Brikman described on his multiple articles about this tool, as well as in his book "Terraform Up & Running First".
The main purpose of this guide is to describe the most common problems when working with Terraform and how we have solved them. It is not a guide to learn Terraform and if you are looking for that we strongly recommend you to read the book "Terraform Up & Running First" and the official Terraform documentation.
Naming conventions
General conventions
-
Use
_(underscore) instead of-(dash) in all: resource names, data source names, variable names, outputs.-
Beware that actual cloud resources have many hidden restrictions in their naming conventions. Some cannot contain dashes, some must be camel cased. These conventions refer to Terraform names themselves.
-
-
Only use lowercase letters and numbers.
Resource and data source arguments
-
Do not repeat resource type in resource name (not partially, nor completely):
-
Good:
resource "aws_route_table" "public" {} -
Bad:
resource "aws_route_table" "public_route_table" {} -
Bad:
resource "aws_route_table" "public_aws_route_table" {}
-
-
Resource name should be named
thisif there is no more descriptive and general name available, or if resource module creates single resource of this type (eg, there is single resource of typeaws_nat_gatewayand multiple resources of typeaws_route_table, soaws_nat_gatewayshould be namedthisandaws_route_tableshould have more descriptive names - likeprivate,public,database). -
Always use singular nouns for names.
-
Use
-inside arguments values and in places where value will be exposed to a human (eg, inside DNS name of RDS instance). -
Include
countargument inside resource blocks as the first argument at the top and separate by newline after it. See example. -
Include
tagsargument, if supported by resource as the last real argument, following bydepends_onandlifecycle, if necessary. All of these should be separated by single empty line. See example. -
When using condition in
countargument use boolean value, if it makes sense, otherwise uselengthor other interpolation. See example. -
To make inverted conditions don’t introduce another variable unless really necessary, use
1 - boolean valueinstead. For example,count = "${1 - var.create_public_subnets}"
Code examples of resource
Usage of count
|
|
Placement of tags
|
|
Variables
-
Don’t reinvent the wheel in resource modules - use the same variable names, description and default as defined in “Argument Reference” section for the resource you are working on.
-
Omit
type = "list"declaration if there isdefault = []also. -
Omit
type = "map"declaration if there isdefault = {}also. -
Use plural form in name of variables of type
listandmap. -
When defining variables order the keys:
description,type,default. -
Always include
descriptionfor all variables even if you think it is obvious.
Outputs
Name for the outputs is important to make them consistent and understandable outside of its scope (when user is using a module it should be obvious what type and attribute of the value is returned).
-
The general recommendation for the names of outputs is that it should be descriptive for the value it contains and be less free-form than you would normally want.
-
Good structure for names of output looks like
{name}_{type}_{attribute}, where:-
{name}is a resource or data source name without provider prefix.{name}foraws_subnetissubnet, for`aws_vpc` it isvpc. -
{type}is a type of a resource sources -
{attribute}is an attribute returned by the output
-
-
If output is returning a value with interpolation functions and multiple resources, the
{name}and{type}there should be as generic as possible (thisis often the most generic and should be preferred). See example. -
If the returned value is a list it should have plural name. See example.
-
Always include
descriptionfor all outputs even if you think it is obvious.
Code examples of output
Return at most one ID of security group:
|
When there are multiple resources of the same type, this should be
preferred and it should be part of name in output, also
another_security_group_id should be named web_security_group_id:
|
Use plural name if the returning value is a list
|
Conditions in output
There are two resources of type aws_db_instance with names this
and this_mssql where at most one resource can be created at the same
time.
|
Structure
Structure used in Terraform projects is very important to be able to apply the good practices we will talked about later. Any project should follow the next file layout. It is a top&down structure: environments → kind of resource → resource.
global (1)
iam
s3
dlm
live (2)
vpc (4)
bastion
dns
network
data-stores (5)
postgres
redis
services (6)
frontend-app
microservice-1
microservice-2
nlv (2)
ci-cd (3)
vpc (4)
bastion
dns
network
services (6)
jenkins
nexus
sonar
sublive (3)
vpc (4)
bastion
dns
network
data-stores (5)
postgres
redis
services (6)
frontend-app
microservice-1
microservice-2
prelive (3)
vpc (4)
bastion
dns
network
data-stores (5)
postgres
redis
services (6)
frontend-app
microservice-1
microservice-2
| 1 | global stores the common resources used in every environment, e.g. S3, IAM… |
| 2 | For each environment we create a folder to place all its particular resources. We usually have two: live and nlv -no live-. |
| 3 | For those environments that contains multiple subenvironments we create a new working group for each of them. In our architecture we have created three subenvironments in nlv: ci-cd, sublive and prelive. |
| 4 | vpc stores all the resources of the network topology for a given environment. |
| 5 | If a given environment uses data sources like Postgres or Redis a new folder called data-stores will be created to hold each all of them. |
| 6 | Applications or microservices deployed in the environment will be created inside the services folder. = Terraform state Terraform is able to know the infrastructure state anytime it needs because it keeps a file with the state applied in the last execution, i.e. whenever we create, update or delete a resource. When Terraform deploys a new configuration it recovers the previous state, compares it with the future state and update the environment taking only the needed steps to get that result. We only have to write what we want to have deployed and Terraform will decide if it needs to create, update or delete any resource to achieve that. |
Terraform is ignorant of any external modification we force on a resource. It is highly recommended that any desired
change is performed using Terraform. However, if we absolutely have no other choice than the external creation we have to
import the resource into the state by using the terraform import command.
|
We have to be extra careful when managing the Terraform state because it is the angular stone. There are some problems when multiple developers use Terraform at the same time:
-
State have to be shared and updated to the last version by everyone in the team.
-
Race conditions problems can occur if the state is not blocked to avoid parallel executions.
Terraform allows to resolve these problems by using an external state storage that will be shared by all the team members and configuring a blocking state like we are doing in the following example:
terraform {
backend "s3" {
# Replace this with your bucket name!
bucket = "terraform-exmaple-state"
key = "global/s3/terraform.tfstate"
region = "us-east-2"
dynamodb_table = "terraform-example-locks"
encrypt = true
}
}
| The storage used to hold the Terraform state should have the encryption enabled to prevent any malicious attacker from getting access to the information. It should also keep a version control of the S3 in order to keep a changes history and perform a rollback in case we need it. |
With this configuration we solve all collaborative problems. However, there are some problems due to the lack of isolation. If the state of every environment is stored in the same file any error in one of them would compromise the others. For example someone deploying a new version of a microservice in sublive could affect the production environment or even worse the state file could be corrupted and brake the whole infrastructure of all environments.
There are two options we can follow to avoid this:
-
isolation using the Terraform workspaces
-
isolation using file layout
In ArCo we are using the second approach. Even though Terraform allows to use multiple workspaces where each of them would hold the state of just one environment we have decided not to use them for various reasons:
-
All stated are in the same bucket. That means that we can access all of them using the same credentials.
-
Workspaces are not visible in the code or the terminal so we can’t automate anything we are weak to manual errors such as deploying or deleting resources in an undesired environment.
Isolation using file layout is closely related to the structure we have defined previously. Thanks to this approach we get:
-
a physical isolation by creating a different folder per environment.
-
a way to configure in a different way all the stores so each of them uses its own authentication and access control system. For example each environment could have a different AWS account with an S3 bucket.
By using file layout we have a cleaner folder structure that allows to easily know what is deployed in each environment and reduces the common configuration between them.
State isolation is not only applied for environments but also in the lower levels. States are managed at component level. Each deployed component is a set of resources that interact with each other, and they should be understood as a single and isolated element, so the implicit risk of any deployment is reduced. If, for example, a new microservice is deployed and there is any problem with it no other component of the infrastructure will be affected. Rollback policies are a lot simpler this way. = DRY (Don’t Repeat Yourself)
Most times the infrastructure will have multiple instances of the same element or even different components that have a lot of common code. Let’s suppose we have the following structure:
live
services
microservice-1
outputs.tf
main.tf
locals.tf
vars.tf
microservice-2
outputs.tf
main.tf
locals.tf
vars.tf
99% of the deployment configuration of both microservices will be exactly the same. Only a little part of the configuration related to what AMI we will be using, or the autoscaling may differ.
IaC is code as it should be treated as such. The way to apply DRY with Terraform is by using modules: small libraries that receives some input parameters and can generate output parameters.
In the following example we have created a module that generates a microservice receiving as parameters all that small part of the configuration that isn’t common.
modules
services
microservice
outputs.tf
main.tf
locals.tf
vars.tf
live
services
microservice-1
main.tf
microservice-2
main.tf
This way, the services use the module microservice configuring those specific values for each service. = Modules
Module repository structure
As best practice for repository structure, each repository containing Terraform code should be a manageable chunk of infrastructure, such as an application, service, or specific type of infrastructure (like common networking infrastructure).
In ArCo we are using the following module repository structure:
modules
aws (1)
networking (2)
module-base-vpc (3)
module-base-sg
module-base-alb
module-base-nlb
compute
module-base-asg
module-base-gitlab-runner-manager
azure
networking
module-base-vnet
| 1 | We have a first level provider categorization of the modules that are into it. This will be mainly cloud providers like aws or azure. This will be a repository subgroup in the modules subgroup. |
| 2 | We have a second level categorization of provider modules that are into it. This major categories are networking, compute, database and security. This categories will be repositories subgroups in the specific provider subgroup. |
| 3 | The module repository will be created in inside the more specific category subgroup. Examples of modules are module-base-vpc or module-base-alb. |
Module structure
Just like it is done in the whole Terraform project, ArCo also defines a standard structure when creating modules.
Terraform files
Inside each module we can find the following Terraform files:
networking
module-base-vpc
locals.tf (1)
variables.tf (2)
data.tf (3)
main.tf (4)
outputs.tf (5)
compute
module-base-gitlab-runner-manager:
01-iam.tf
02-security-groups.tf
locals.tf
variables.tf
data.tf
main.tf
outputs.tf
| 1 | locals.tf: file that holds all the local variables. |
| 2 | variables.tf: file that defines the input parameters. |
| 3 | outputs.tf: file that defines the output parameters. |
| 4 | data.tf: file that defines the information that needs to be recovered of AWS via the data sources provided by Terraform. |
| 5 | main.tf: file where we can find the creation of the resources needed in the module. |
Sometimes we can split the main.tf. file if it is too large. This is happening, for example, in module-base-gitlab-runner-manager where we have extracted the IAM and Security Group resources configuration in separate files.
Non-Terraform files
There are also non-Terraform files and directories in the module structure.
networking
module-base-vpc
examples/ (1)
nonprod/
prod/
tests/ (2)
module_test.go
go.mod (3)
...
| 1 | examples/: directory that holds module implementations that will be used as test cases in the E2E testing step. It is recommended to define at least two test cases, one with a basic module usage, and other one with a more complex functionality. |
| 2 | tests/: directory that holds the module_test.go file in which is defined the execution and validation of our E2E test cases. See module testing. |
| 3 | go.mod: file that initializes the project as a Go module. It contains the module name and their dependencies. |
Module versioning
Using modules helps with reuse and maintainability, but we need to deal with the isolation of environments. If a given module is being used in both prelive and live environments, any change will affect all of them and any deployment after this moment. This makes the testing process difficult because we can’t try new changes just in prelive.
The chosen approach in ArCo is to resolve these situations using versioning. This way each environment will use the version they need. We can have the prelive environment using the version 0.2.0 and production environment using the version 0.1.0. If we encounter a bug it will only affect the environment using that version and not any other.
Terraform allows downloading modules remotely and it is compatible with GIT, Mercurial and HTTP URLs. We only need to use the attribute source indicating where it can find the version of the module.
module "example" {
source = "git@{git-url}:iac/terraform/modules/foo/bar/module.git?ref=v0.0.1"
...
instance_type = "t2.mciro"
...
}
Module dependencies
When working with Terraform it is quite common modules have inner dependencies between themselves. E.g. if we want to deploy a microservice we have to have created all the network infrastructure where the microservice will be. In this case the microservice module need the VPC id and the subnet id.
The way to share information between modules is by using the components state and the output parameters. Let’s see this with an example. We said that we need information of the VPC module, so we have to expose it on the output parameters.
# networking/module-base-vpc/outputs.tf
output "private_subnets" {
description = "List of IDs of private subnets"
value = aws_subnet.private.*.id
}
Now the microservice module needs access to the state of the VPC module.
# compute/ec2/data.tf
data "terraform_remote_state" "vector_vpc" {
backend = "s3"
config = {
bucket = "example_bucket"
key = "nvl/sublive/vpc/terraform.tfstate"
region = "eu-central-1"
}
}
With this we can now get any of the output parameters that are defined there
# compute/ec2/main.tf
resource "aws_instance" "vector_instance" {
...
subnet_id = data.terraform_remote_state.vector_vpc.outputs.vector_subnet_id
...
}
| It is important to note that the Data Stores only have read permissions so even a module can query the state of other but it cant modify it. This way the immutability of the modules is warrantied. |
Module testing
Why testing?
The DevOps world is full of fear: fear of outages, security breaches, data loss, fear of change…
"Fear leads to anger. Anger leads to hate. Hate leads to suffering.
Many DevOps teams deal with this fear deploying less frequently, and these just make the problem worse. There’s a better way to deal with this fear: automated tests. Automated tests give you the confidence to make changes, so we can fight fear with confidence.
Static analysis
In this testing phase we’ll test the code without deploying it in order to find syntactic and structural issues, and catch common errors.
In ArCo we are using the following tools for static analysis:
-
Go testing:
-
go fmt: to check if every Go file in the tests folder has been previously formatted
-
go vet: to examine Go source code and report suspicious constructs, such as Printf calls whose arguments do not align with the format string
-
-
Terraform testing:
-
terraform fmt: used to rewrite Terraform configuration files to a canonical format and style
-
terraform validate: to verify that a configuration is syntactically valid and internally consistent, regardless of any provided variables or existing state
-
tflint: helps to find provider-specific invalid states in your code, like incorrect instance_type attributes and so on
-
Unit tests
Unit testing refers to tests that verify the functionality of a specific section of code, in this case, to test an isolated Terraform module.
In order to help us in the process of automating Terraform unit tests, ArCo uses Terratest, a Go library that makes it easier to write automated tests for infrastructure code. It provides a variety of helper functions and patterns for common infrastructure testing tasks.
See writing unit tests.
Module development
Software requirements
-
Terraform (open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned)
-
Tools:
-
Terragrunt (thin wrapper for Terraform that provides extra tools for working with multiple Terraform modules)
-
tfenv (version manager, used to quickly change between Terraform versions)
-
tflint (Terraform linter, focused on possible errors, best practices, etc…)
-
Terraform Docs (a utility to generate documentation from Terraform modules in various output formats)
-
-
-
Golang (open source programming language that makes it easy to build simple, reliable, and efficient software, used mainly in our projects for running tests)
-
aws-nuke (for fully deleting a whole AWS account resources, useful for cleaning messy test accounts after failed Terraform runs)
CI/CD
In cloud and IaC projects, the continuous integration and continuous delivery (CI/CD) are vital to create an agile environment to add new incremental changes automatically while maintaining the quality. This means that every change of code uploaded to the repository will be build, tested and deployed immediately in the infrastructure. There are several technologies to achieve this.
The tool used to execute the CI/CD pipeline in ArCo Terraform module projects is Gitlab CI. We have defined a set of stages with the purpose of being used by projects of a different nature.
This pipeline is built on top of three small flows.
The first one of those small flows will be triggered by any commit made in any branch:
-
Static analysis: In this stage there are several tests to be executed in order to comply with the required module quality. See static analysis in module testing docs.
The second one of those small flows will be triggered whenever a new commit in any branch with an open Merge Request is pushed:
-
Unit tests: In this stage, the Go test files in the tests folder will be executed and a full module unit test will be performed in order to ensure that the module works as expected. See unit tests in module testing docs.
The last stage runs only upon manual interaction when a Merge Requests is merged into the master branch and the Static analysis and Unit tests pipeline jobs ends successfully:
-
Release: In this stage we start the release process using Semantic Release tool, and a custom Terraform Modules Semantic Release shareable config package.
-
Analyze commits and determine next module release version based on the Conventional Commits messages.
-
Replace the current module version in
locals.tffile for the detected next release version. -
Update or create the
CHANGELOG.mdadding the release notes based on the commit messages. -
Create the updated module documentation invoking the
terraform-docstool. -
Commit and push both
locals.tf,CHANGELOG.mdandREADME.adocmodified files. -
Tag and release the repository using the detected next module release version.
-
| Any change in the master branch, that starts the pipeline, should only be performed by the Merge Request process once it has been validated by those with the Maintainer role. |