Account wide terraform - alphagov/notifications-manuals GitHub Wiki

This document is intended for engineers who want to learn the basics of what Account-wide Terraform (AWT) is, how to use it to make infra changes and how to use it to debug issues. It assumes you know Terraform and GDS CLI basics.

AWT acts as a bootstrap for deploying the initial infra required for other foundational Notify infra to be deployed. It does not deploy the Notify application and the infra it requires, which is deployed from notifications-aws.

Why do we need a separate account-wide Terraform repo?

Improved security - IAM config for Concourse workers is stored in this repo. If this was stored in notifications-aws a Concourse worker may be able to access this config
notifications-aws uses roles some of the infra we have declared in this repo so must be deployed first.
Application of environment-specific config - Infra changes may need to be applied to a specific env (e.g. for testing new IAM permissions) and they can be tested by specifying an environment to apply them from in this repo. There are 2 types of envs you can apply AWT changes to:
1. notify-env - dev[a-f], preview, staging, production
2. notify-deploy-env notify-deploy, notify-deploy-staging (AWS accounts hosting concourse)

How to apply AWT changes

Terraform commands are run via make commands in the root of the repo.

make <env name> <init | plan | apply | destroy>

In rare cases you may need to bootstrap an environment.

make <env name> bootstrap

NOTE: Changes to the production environment must have a cyberthumb before being applied.

When to use this

When applying any changes to the Terraform. There is no pipeline for deploying AWT changes, so they must be done locally.

To test or apply changes

For example, if you wanted to move away from using parameter store (SSM) for storing secrets and move to Secrets Manager instead, you would need to change what permissions the manipulate_dev_secrets role has access to to preserve the behaviour of that role being able to access secrets.

This role currently has these permissions:

# notify-deploy-env/roles.tf

data "aws_iam_policy_document" "manipulate_dev_secrets" {
  # Other statemment omitted for brevity
  
  statement {
    effect = "Allow"
    actions = ["ssm:DescribeParameters"]
    resources = ["*"]

We currently cannot list secrets in secrets manager with the devsecrets role.

gds aws notify-deploy-staging-devsecrets -s

wesley.hindle@GDS13716 notifications-aws-account-wide-terraform % aws secretsmanager list-secrets

An error occurred (AccessDeniedException) when calling the ListSecrets operation: User: arn:aws:sts::390844751771:assumed-role/wesley.hindle-devsecrets/1753170992424991000 is not authorized to perform: secretsmanager:ListSecrets because no identity-based policy allows the secretsmanager:ListSecrets action

To test whether this role could access data in secrets manager we must give it access to secrets manager.

# notify-deploy-env/roles.tf

data "aws_iam_policy_document" "manipulate_dev_secrets" {
  # Other statemment omitted for brevity
  
  statement {
    effect = "Allow"
    actions = [
      "ssm:DescribeParameters",
      "secretsmanager:ListSecrets"
    ]
    resources = ["*"]

If we run a terraform plan to see the changes. In our case:

gds aws notify-deploy-staging-admin -- make notify-deploy-staging plan

Terraform will perform the following actions:

  # aws_iam_policy.manipulate_dev_secrets will be updated in-place
  ~ resource "aws_iam_policy" "manipulate_dev_secrets" {
        id               = "arn:aws:iam::390844751771:policy/ManipulateDevSecrets"
        name             = "ManipulateDevSecrets"
      ~ policy           = jsonencode(
      # Omitted for brevity
		  ~ Action   = "ssm:DescribeParameters" -> [
			  + "ssm:DescribeParameters",
			  + "secretsmanager:ListSecrets",
			]
			
Plan: 0 to add, 1 to change, 0 to destroy.

And then apply it:

gds aws notify-deploy-staging-admin -- make notify-deploy-staging apply

And try to list the secret again, we can now see it.

wesley.hindle@GDS13716 notifications-aws-account-wide-terraform % gds aws notify-deploy-staging-devsecrets -s                            
wesley.hindle@GDS13716 notifications-aws-account-wide-terraform % aws secretsmanager list-secrets
{
    "SecretList": [
		"Name": "ecr-pullthroughcache/test123/ecr/dockerhub_credentials",
		# Omitted
}

To check config drift

Occasionally manual changes happen in the AWS console which can result in new errors. A quick way to check if this is the case is to run a plan and see what changes are shown in the format

gds aws notify-<env name>-admin -- make <env-name> init | plan | apply | delete

Terraform will perform the following actions:

  # module.readonly_users["wesley.hindle"].aws_iam_role.gds_user_role will be updated in-place
  ~ resource "aws_iam_role" "gds_user_role" {
        id                    = "wesley.hindle-readonly"
        name                  = "wesley.hindle-readonly"
      ~ tags                  = {
          - "Manually-Added-Tag" = "Says Hello" -> null
        }
      ~ tags_all              = {
          - "Manually-Added-Tag" = "Says Hello" -> null
            # (2 unchanged elements hidden)
        }
        # (8 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

To help debug

This example applies to notify-env repo.

I want to look through some Cloudwatch logs, but I'm now unable to, whereas previously I was.

We know that there's some sort of IAM permission, but rather than messing about comparing what's on main and your branch we can instead run a terraform plan to see what the changes are.

# Note apply runs a plan first before applying any changes
gds aws notify-dev-a-admin -- make dev-a apply

Terraform will perform the following actions:

  # module.readonly_users["wesley.hindle"].aws_iam_role_policy_attachment.gds_user_role_policy_attachments[0] will be created
  + resource "aws_iam_role_policy_attachment" "gds_user_role_policy_attachments" {
      + id         = (known after apply)
      + policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
      + role       = "wesley.hindle-readonly"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

So we know there is config drift, but some inspection is required before blindly applying the change to see if it will actually fix our problem. arn:aws:iam::aws:policy/ReadOnlyAccess is an AWS-managed policy, so we can easily search online for what permissions this role grants.

  "Version" : "2012-10-17",
  "Statement" : [
    {
      "Sid" : "ReadOnlyActionsGroup1",
      "Effect" : "Allow",
      "Action" : [
        "logs:Describe*",
        # Other actions omitted for brevity

We now have confidence that re adding this policy to the role will grant the logs:DescribeLogGroups action required to be able to see log groups. After applying the change we can:

How to deploy changes

NOTE: This section refers to how changes have been rolled out in the past and is not best practice.

Once your changes have been merged in via the PR process they will then need applying to each environment locally as there is no pipeline to apply changes to this repo. This process will take a few days to complete.

Instructions on how to apply changes can be found in the How to apply AWT changes section above.

Dev envs

Changes should be first be applied to all unoccupied dev environments. You should then communicate on the #govuk-notify-infrastructure-team channel that those use dev envs will need to apply these changes to their environments themselves, or that you will do it for them if they wish. You must communicate that this will overwrite any manual changes to AWT's infra they have made and that it may alter the behaviour of the feature they're working on in their environment.

Staging env

Once the changes have been applied to all dev envs you can then roll the changes out to the staging environment. After running an apply locally you can then manually trigger a deploy on the staging environment, which will run the automated tests and flag any issues the new changes have introduced.

If these tests fail you do not need to worry about pinning the production pack bag, as the changes have only been applied to staging at this point.

If successful, you should wait a few days until at least 1 other deploy to production has successfully rolled out. At which point it can be assumed that the changes have not broken anything.

Production env

After running an apply locally you can then manually trigger a deploy on the production environment.