Terraform

Terraform is one of the most used provisioning tools, and it has a great support of the community. We chose this tool over the others for several reasons: its popularity, because it is a declarative language, it doesn’t need any other element like an agent, and it uses and immutable infrastructure paradigm.

Over the years, community has been improving projects developed using this tool. Some techniques have been added and the structure have also been changed to improve both maintainability and usability. In particular ArCo’s architecture follows the principles that Yevgeniy Brikman described on his multiple articles about this tool, as well as in his book "Terraform Up & Running First".

The main purpose of this guide is to describe the most common problems when working with Terraform and how we have solved them. It is not a guide to learn Terraform and if you are looking for that we strongly recommend you to read the book "Terraform Up & Running First" and the official Terraform documentation.

Naming conventions

General conventions

Use _ (underscore) instead of - (dash) in all: resource names, data source names, variable names, outputs.
- Beware that actual cloud resources have many hidden restrictions in their naming conventions. Some cannot contain dashes, some must be camel cased. These conventions refer to Terraform names themselves.

Only use lowercase letters and numbers.

Resource and data source arguments

Do not repeat resource type in resource name (not partially, nor completely):
- Good: resource "aws_route_table" "public" {}
- Bad: resource "aws_route_table" "public_route_table" {}
- Bad: resource "aws_route_table" "public_aws_route_table" {}
Resource name should be named this if there is no more descriptive and general name available, or if resource module creates single resource of this type (eg, there is single resource of type aws_nat_gateway and multiple resources of type aws_route_table, so aws_nat_gateway should be named this and aws_route_table should have more descriptive names - like private, public, database).
Always use singular nouns for names.
Use - inside arguments values and in places where value will be exposed to a human (eg, inside DNS name of RDS instance).
Include count argument inside resource blocks as the first argument at the top and separate by newline after it. See example.
Include tags argument, if supported by resource as the last real argument, following by depends_on and lifecycle, if necessary. All of these should be separated by single empty line. See example.
When using condition in count argument use boolean value, if it makes sense, otherwise use length or other interpolation. See example.
To make inverted conditions don’t introduce another variable unless really necessary, use 1 - boolean value instead. For example, count = "${1 - var.create_public_subnets}"

Code examples of `resource`

Usage of `count`

resource "aws_route_table" "public" {
  count  = "2"

  vpc_id = "vpc-12345678"
  # ... remaining arguments omitted
}

resource "aws_route_table" "public" {
  vpc_id = "vpc-12345678"
  count  = "2"

  # ... remaining arguments omitted
}

Placement of `tags`

resource "aws_nat_gateway" "this" {
  count         = "1"

  allocation_id = "..."
  subnet_id     = "..."

  tags = {
    Name = "..."
  }

  depends_on = ["aws_internet_gateway.this"]

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_nat_gateway" "this" {
  count = "1"

  tags = "..."

  depends_on = ["aws_internet_gateway.this"]

  lifecycle {
    create_before_destroy = true
  }

  allocation_id = "..."
  subnet_id     = "..."
}

Conditions in `count`

count = "${length(var.public_subnets) > 0 ? 1 : 0}"
count = "${var.create_public_subnets}"

Variables

Don’t reinvent the wheel in resource modules - use the same variable names, description and default as defined in “Argument Reference” section for the resource you are working on.
Omit type = "list" declaration if there is default = [] also.
Omit type = "map" declaration if there is default = {} also.
Use plural form in name of variables of type list and map.
When defining variables order the keys: description , type, default .
Always include description for all variables even if you think it is obvious.

Outputs

Name for the outputs is important to make them consistent and understandable outside of its scope (when user is using a module it should be obvious what type and attribute of the value is returned).

The general recommendation for the names of outputs is that it should be descriptive for the value it contains and be less free-form than you would normally want.
Good structure for names of output looks like {name}_{type}_{attribute} , where:
1. {name} is a resource or data source name without provider prefix. {name} for aws_subnet is subnet, for`aws_vpc` it is vpc.
2. {type} is a type of a resource sources
3. {attribute} is an attribute returned by the output
4. See examples.
If output is returning a value with interpolation functions and multiple resources, the {name} and {type} there should be as generic as possible (this is often the most generic and should be preferred). See example.
If the returned value is a list it should have plural name. See example.
Always include description for all outputs even if you think it is obvious.

Code examples of `output`

Return at most one ID of security group:

output "this_security_group_id" {
  description = "The ID of the security group"
  value       = "${element(concat(coalescelist(aws_security_group.this.*.id, aws_security_group.this_name_prefix.*.id), list("")), 0)}"
}

When there are multiple resources of the same type, this should be preferred and it should be part of name in output, also another_security_group_id should be named web_security_group_id:

output "security_group_id" {
  description = "The ID of the security group"
  value       = "${element(concat(coalescelist(aws_security_group.this.*.id, aws_security_group.web.*.id), list("")), 0)}"
}

output "another_security_group_id" {
  description = "The ID of web security group"
  value       = "${element(concat(aws_security_group.web.*.id, list("")), 0)}"
}

Use plural name if the returning value is a list

output "this_rds_cluster_instance_endpoints" {
  description = "A list of all cluster instance endpoints"
  value       = ["${aws_rds_cluster_instance.this.*.endpoint}"]
}

Conditions in `output`

There are two resources of type aws_db_instance with names this and this_mssql where at most one resource can be created at the same time.

output "this_db_instance_id" {
  description = "The RDS instance ID"
  value       = "${element(concat(coalescelist(aws_db_instance.this_mssql.*.id, aws_db_instance.this.*.id), list("")), 0)}"
}

Structure

Structure used in Terraform projects is very important to be able to apply the good practices we will talked about later. Any project should follow the next file layout. It is a top&down structure: environments → kind of resource → resource.

global (1)
    iam
    s3
    dlm
live (2)
    vpc (4)
        bastion
        dns
        network
    data-stores (5)
        postgres
        redis
    services (6)
        frontend-app
        microservice-1
        microservice-2
nlv (2)
    ci-cd (3)
        vpc (4)
            bastion
            dns
            network
        services (6)
            jenkins
            nexus
            sonar
    sublive (3)
        vpc (4)
            bastion
            dns
            network
        data-stores (5)
            postgres
            redis
        services (6)
            frontend-app
            microservice-1
            microservice-2
    prelive (3)
        vpc (4)
            bastion
            dns
            network
        data-stores (5)
            postgres
            redis
        services (6)
            frontend-app
            microservice-1
            microservice-2

1	global stores the common resources used in every environment, e.g. S3, IAM…
2	For each environment we create a folder to place all its particular resources. We usually have two: live and nlv -no live-.
3	For those environments that contains multiple subenvironments we create a new working group for each of them. In our architecture we have created three subenvironments in nlv: ci-cd, sublive and prelive.
4	vpc stores all the resources of the network topology for a given environment.
5	If a given environment uses data sources like Postgres or Redis a new folder called data-stores will be created to hold each all of them.
6	Applications or microservices deployed in the environment will be created inside the services folder. = Terraform state Terraform is able to know the infrastructure state anytime it needs because it keeps a file with the state applied in the last execution, i.e. whenever we create, update or delete a resource. When Terraform deploys a new configuration it recovers the previous state, compares it with the future state and update the environment taking only the needed steps to get that result. We only have to write what we want to have deployed and Terraform will decide if it needs to create, update or delete any resource to achieve that.

Terraform is ignorant of any external modification we force on a resource. It is highly recommended that any desired change is performed using Terraform. However, if we absolutely have no other choice than the external creation we have to import the resource into the state by using the terraform import command.

We have to be extra careful when managing the Terraform state because it is the angular stone. There are some problems when multiple developers use Terraform at the same time:

State have to be shared and updated to the last version by everyone in the team.
Race conditions problems can occur if the state is not blocked to avoid parallel executions.

Terraform allows to resolve these problems by using an external state storage that will be shared by all the team members and configuring a blocking state like we are doing in the following example:

terraform {
    backend "s3" {
    # Replace this with your bucket name!
    bucket         = "terraform-exmaple-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-2"

    dynamodb_table = "terraform-example-locks"
    encrypt        = true
  }
}

The storage used to hold the Terraform state should have the encryption enabled to prevent any malicious attacker from getting access to the information. It should also keep a version control of the S3 in order to keep a changes history and perform a rollback in case we need it.

With this configuration we solve all collaborative problems. However, there are some problems due to the lack of isolation. If the state of every environment is stored in the same file any error in one of them would compromise the others. For example someone deploying a new version of a microservice in sublive could affect the production environment or even worse the state file could be corrupted and brake the whole infrastructure of all environments.

There are two options we can follow to avoid this:

isolation using the Terraform workspaces
isolation using file layout

In ArCo we are using the second approach. Even though Terraform allows to use multiple workspaces where each of them would hold the state of just one environment we have decided not to use them for various reasons:

All stated are in the same bucket. That means that we can access all of them using the same credentials.
Workspaces are not visible in the code or the terminal so we can’t automate anything we are weak to manual errors such as deploying or deleting resources in an undesired environment.

Isolation using file layout is closely related to the structure we have defined previously. Thanks to this approach we get:

a physical isolation by creating a different folder per environment.
a way to configure in a different way all the stores so each of them uses its own authentication and access control system. For example each environment could have a different AWS account with an S3 bucket.

By using file layout we have a cleaner folder structure that allows to easily know what is deployed in each environment and reduces the common configuration between them.

State isolation is not only applied for environments but also in the lower levels. States are managed at component level. Each deployed component is a set of resources that interact with each other, and they should be understood as a single and isolated element, so the implicit risk of any deployment is reduced. If, for example, a new microservice is deployed and there is any problem with it no other component of the infrastructure will be affected. Rollback policies are a lot simpler this way. = DRY (Don’t Repeat Yourself)

Most times the infrastructure will have multiple instances of the same element or even different components that have a lot of common code. Let’s suppose we have the following structure:

live
    services
        microservice-1
            outputs.tf
            main.tf
            locals.tf
            vars.tf
        microservice-2
            outputs.tf
            main.tf
            locals.tf
            vars.tf

99% of the deployment configuration of both microservices will be exactly the same. Only a little part of the configuration related to what AMI we will be using, or the autoscaling may differ.

IaC is code as it should be treated as such. The way to apply DRY with Terraform is by using modules: small libraries that receives some input parameters and can generate output parameters.

In the following example we have created a module that generates a microservice receiving as parameters all that small part of the configuration that isn’t common.

modules
    services
        microservice
            outputs.tf
            main.tf
            locals.tf
            vars.tf

live
    services
        microservice-1
            main.tf
        microservice-2
            main.tf

This way, the services use the module microservice configuring those specific values for each service. = Modules

Module repository structure

As best practice for repository structure, each repository containing Terraform code should be a manageable chunk of infrastructure, such as an application, service, or specific type of infrastructure (like common networking infrastructure).

In ArCo we are using the following module repository structure:

modules
    aws (1)
        networking (2)
            module-base-vpc (3)
            module-base-sg
            module-base-alb
            module-base-nlb
        compute
            module-base-asg
            module-base-gitlab-runner-manager
    azure
        networking
            module-base-vnet

1	We have a first level provider categorization of the modules that are into it. This will be mainly cloud providers like aws or azure. This will be a repository subgroup in the modules subgroup.
2	We have a second level categorization of provider modules that are into it. This major categories are networking, compute, database and security. This categories will be repositories subgroups in the specific provider subgroup.
3	The module repository will be created in inside the more specific category subgroup. Examples of modules are module-base-vpc or module-base-alb.

Module structure

Just like it is done in the whole Terraform project, ArCo also defines a standard structure when creating modules.

Terraform files

Inside each module we can find the following Terraform files:

networking
    module-base-vpc
        locals.tf (1)
        variables.tf (2)
        data.tf (3)
        main.tf (4)
        outputs.tf (5)
compute
    module-base-gitlab-runner-manager:
        01-iam.tf
        02-security-groups.tf
        locals.tf
        variables.tf
        data.tf
        main.tf
        outputs.tf

1	locals.tf: file that holds all the local variables.
2	variables.tf: file that defines the input parameters.
3	outputs.tf: file that defines the output parameters.
4	data.tf: file that defines the information that needs to be recovered of AWS via the data sources provided by Terraform.
5	main.tf: file where we can find the creation of the resources needed in the module.

Sometimes we can split the main.tf. file if it is too large. This is happening, for example, in module-base-gitlab-runner-manager where we have extracted the IAM and Security Group resources configuration in separate files.

Non-Terraform files

There are also non-Terraform files and directories in the module structure.

networking
    module-base-vpc
        examples/ (1)
            nonprod/
            prod/
        tests/ (2)
            module_test.go
        go.mod (3)
        ...

1	examples/: directory that holds module implementations that will be used as test cases in the E2E testing step. It is recommended to define at least two test cases, one with a basic module usage, and other one with a more complex functionality.
2	tests/: directory that holds the module_test.go file in which is defined the execution and validation of our E2E test cases. See module testing.
3	go.mod: file that initializes the project as a Go module. It contains the module name and their dependencies.

Module versioning

Using modules helps with reuse and maintainability, but we need to deal with the isolation of environments. If a given module is being used in both prelive and live environments, any change will affect all of them and any deployment after this moment. This makes the testing process difficult because we can’t try new changes just in prelive.

The chosen approach in ArCo is to resolve these situations using versioning. This way each environment will use the version they need. We can have the prelive environment using the version 0.2.0 and production environment using the version 0.1.0. If we encounter a bug it will only affect the environment using that version and not any other.

Terraform allows downloading modules remotely and it is compatible with GIT, Mercurial and HTTP URLs. We only need to use the attribute source indicating where it can find the version of the module.

module "example" {
  source = "git@{git-url}:iac/terraform/modules/foo/bar/module.git?ref=v0.0.1"
...

  instance_type = "t2.mciro"

...
}

Module dependencies

When working with Terraform it is quite common modules have inner dependencies between themselves. E.g. if we want to deploy a microservice we have to have created all the network infrastructure where the microservice will be. In this case the microservice module need the VPC id and the subnet id.

The way to share information between modules is by using the components state and the output parameters. Let’s see this with an example. We said that we need information of the VPC module, so we have to expose it on the output parameters.

# networking/module-base-vpc/outputs.tf

output "private_subnets" {
  description = "List of IDs of private subnets"
  value       = aws_subnet.private.*.id
}

Now the microservice module needs access to the state of the VPC module.

# compute/ec2/data.tf

data "terraform_remote_state" "vector_vpc" {
  backend = "s3"
  config = {
    bucket = "example_bucket"
    key    = "nvl/sublive/vpc/terraform.tfstate"
    region = "eu-central-1"
  }
}

With this we can now get any of the output parameters that are defined there

# compute/ec2/main.tf

resource "aws_instance" "vector_instance" {
...
  subnet_id                   = data.terraform_remote_state.vector_vpc.outputs.vector_subnet_id
...
}

It is important to note that the Data Stores only have read permissions so even a module can query the state of other but it cant modify it. This way the immutability of the modules is warrantied.

Module testing

Why testing?

The DevOps world is full of fear: fear of outages, security breaches, data loss, fear of change…

"Fear leads to anger. Anger leads to hate. Hate leads to suffering.

Many DevOps teams deal with this fear deploying less frequently, and these just make the problem worse. There’s a better way to deal with this fear: automated tests. Automated tests give you the confidence to make changes, so we can fight fear with confidence.

Static analysis

In this testing phase we’ll test the code without deploying it in order to find syntactic and structural issues, and catch common errors.

In ArCo we are using the following tools for static analysis:

Go testing:
- go fmt: to check if every Go file in the tests folder has been previously formatted
- go vet: to examine Go source code and report suspicious constructs, such as Printf calls whose arguments do not align with the format string
Terraform testing:
- terraform fmt: used to rewrite Terraform configuration files to a canonical format and style
- terraform validate: to verify that a configuration is syntactically valid and internally consistent, regardless of any provided variables or existing state
- tflint: helps to find provider-specific invalid states in your code, like incorrect instance_type attributes and so on

Unit tests

Unit testing refers to tests that verify the functionality of a specific section of code, in this case, to test an isolated Terraform module.

In order to help us in the process of automating Terraform unit tests, ArCo uses Terratest, a Go library that makes it easier to write automated tests for infrastructure code. It provides a variety of helper functions and patterns for common infrastructure testing tasks.

See writing unit tests.

Integration tests

TODO

E2E Tests

TODO

Module development

Software requirements

Terraform (open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned)
- Tools:
  - Terragrunt (thin wrapper for Terraform that provides extra tools for working with multiple Terraform modules)
  - tfenv (version manager, used to quickly change between Terraform versions)
  - tflint (Terraform linter, focused on possible errors, best practices, etc…)
  - Terraform Docs (a utility to generate documentation from Terraform modules in various output formats)
Golang (open source programming language that makes it easy to build simple, reliable, and efficient software, used mainly in our projects for running tests)
aws-nuke (for fully deleting a whole AWS account resources, useful for cleaning messy test accounts after failed Terraform runs)

CI/CD

In cloud and IaC projects, the continuous integration and continuous delivery (CI/CD) are vital to create an agile environment to add new incremental changes automatically while maintaining the quality. This means that every change of code uploaded to the repository will be build, tested and deployed immediately in the infrastructure. There are several technologies to achieve this.

The tool used to execute the CI/CD pipeline in ArCo Terraform module projects is Gitlab CI. We have defined a set of stages with the purpose of being used by projects of a different nature.

This pipeline is built on top of three small flows.

The first one of those small flows will be triggered by any commit made in any branch:

Static analysis: In this stage there are several tests to be executed in order to comply with the required module quality. See static analysis in module testing docs.

The second one of those small flows will be triggered whenever a new commit in any branch with an open Merge Request is pushed:

Unit tests: In this stage, the Go test files in the tests folder will be executed and a full module unit test will be performed in order to ensure that the module works as expected. See unit tests in module testing docs.

The last stage runs only upon manual interaction when a Merge Requests is merged into the master branch and the Static analysis and Unit tests pipeline jobs ends successfully:

Release: In this stage we start the release process using Semantic Release tool, and a custom Terraform Modules Semantic Release shareable config package.
- Analyze commits and determine next module release version based on the Conventional Commits messages.
- Replace the current module version in locals.tf file for the detected next release version.
- Update or create the CHANGELOG.md adding the release notes based on the commit messages.
- Create the updated module documentation invoking the terraform-docs tool.
- Commit and push both locals.tf,CHANGELOG.md and README.adoc modified files.
- Tag and release the repository using the detected next module release version.

Any change in the master branch, that starts the pipeline, should only be performed by the Merge Request process once it has been validated by those with the Maintainer role.