Concourse - alphagov/notifications-manuals GitHub Wiki

This documentation is for running and maintaining Concourse. If you want to deploy a change to production you should see the guidance on merging and deploying

All our apps are deployed and our smoketests run through our own self-managed concourse instance.

This can be found at https://concourse.notify.tools/ (requires VPN to access)

Background

Concourse is deployed from 2 repos, a 'Concourse' and a 'Concourse deployment' repo.

Originally there were Big little Concourse and 'Big Little Concourse deploy' repos, which deployed PaaS. These repos can now be forked to create a 'Little Concourse', which your team can use for deploying your own version of Concourse. For Notify these are notify-concourse and notify-concourse-deployment

Notify's concourse can only be deployed locally via Terraform, using both the notify-concourse and notify-concourse-deployment repos. However, when you clone notifications-concourse-deployment it must be called module-repo. This is because Big Little Concourse was written to be forked by multiple teams, so this repo was given a generic name. Notify's fork opted to not to rename refs to module-repo to notify-concourse in the notify-concourse-deployment repo.

For example

├── module-repo # notifications-concourse in Github
├── notifications-concourse-deployment
	└── terraform
	    └── deployments
	            ├── concourse
				│   ├── aws-deploy.tf
				│   ├── compute_optimizer.tf
				│   ├── ecr_pullthrough.tf
				│   ├── ecr.tf
	            └── concourse-staging

# notifications-concourse-deployment
# terraform/deployments/concourse/ecr.tf

source = "../../../../module-repo/terraform/modules/ecr"

Team

There are a few teams within Concourse.

"Notify"
- Pipelines for building Notify images, and ancillary services. (Also updating those pipelines themselves). Will eventually be replaced by "final-image-builds", and "pull-requests" teams
- managed in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-notify.tf
"Main"
- Pipelines for deploying Concourse itself (eg to change the boxes that concourse runs on, change the list of users who are allowed to access concourse)
- managed via the main_team_github_users (superadmins of the entire concourse) and main_team_pipeline_operator_github_users vars in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/site.tf
"Final-image-builds"
- Pipelines for building production docker images ready to be deployed.
"Pull-requests"
- Pipelines for running pull request checks and building demo images (for testing in the dev environments)
"Aws-pr"
- Pipelines for pull request checks (terraform plan) for the notifications-aws repo
"Preview", "Staging", "Production"
- Pipelines for deploying Notify images across three environments.
"Dev-[a-d] and Dev-livemail"
- Pipelines for deploying Notify to dev environments (to test changes to infra, etc, without impacting the main pipelines), and also to dev-livemail environment (environment which accessible without VPN, used for prototype testing, user research, accessibility audits etc).
- managed via https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-dev.tf

There are a few user classes (strictly ordered, so for example members can also operate pipelines and view them as well)

owner - can update user auth, create teams, etc. do anything.
member - can update pipelines etc via the fly CLI
pipeline_operator - can trigger pipelines, pause pipelines, pin resources, and see progress, but can't update via the fly CLI
viewer - can only view pipeline progress

Viewing / Making changes to Notify's concourse pipelines

Using fly

You can use the fly CLI to see and modify pipelines for the Notify team.

brew install fly

fly login -c https://concourse.notify.tools/ -n notify -t notify

Authentication is handled via Github.

Working with secrets

When Concourse needs access to secrets it gets them in two ways.

Concourse will access our credentials repo and retrieve secrets from it. This is generally used as part of a pipeline task.
Concourse will access secrets that we have stored in AWS SSM. This is generally used as part of resource configuration because we are unable to get secrets from our credentials repo whilst not in a task

Secrets then be referenced in resources by using the ((double-bracket syntax)).

To put secrets from our credentials repo into AWS SSM for use outside of tasks, we have a concourse-secrets pipeline. This is configured in https://github.com/alphagov/notifications-aws/blob/master/concourse/concourse-secrets-pipeline.yml.j2.

Some secrets are separately put into AWS SSM as part of the creation of Concourse, for example names of S3 buckets that are created for pipelines to put files into. Secrets created in this way start with readonly.

Monitoring our concourse instance

You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.

Making changes to our concourse instance

Our concourse instance is defined in two terraform repositories. They're split for legacy reasons.

Changes can be tested inside the concourse-staging environment. This will normally be in a "destroyed" state. Instructions on how to use the staging environment can be found here

Once changes are merged to either of these repos, you will need to trigger the deploy from concourse via the "deploy" pipeline. This will take 20 mins or so and may interrupt running jobs as the worker instances rotate, but is otherwise zero-downtime.

Concourse runs within the notify-deploy AWS environment, and the role can be assumed using the gds cli by senior developers.

Updating the concourse version

The concourse version is fixed here. Also check to see if the resource has been manually pinned.

notifications-concourse-deployment

This repo defines some of the variables that you might expect to change, such as the definition of the info pipeline, how many AWS instances concourse has (and of what instance type), which github users have permission to view/edit the pipelines, the GDS IP addresses to allow access from and other similar variables.

This repo also contains instructions for how we created the concourse from scratch and thoughts from Reliability Engineering on how to manage it.

Concourse and concourse-staging use separate terraform projects. However they should be "roughly" kept inline.

notifications-concourse

This repo contains terraform that defines how concourse is hosted and how it interacts with itself e.g. ec2 instances, security groups, route53 DNS records, IAM roles, etc.

Troubleshooting

If a concourse deploy gets stuck

When applying terraform changes, concourse sometimes gets into a race condition e.g.

no workers satisfying: resource type 'git', version: '2.5'

We think this is because all the existing workers have been killed as part of the deployment. It's worth waiting a few minutes to see if new workers become available - try manually starting a new run of the job.

Otherwise, rotating the EC2 workers may have failed. Devs can log in to the AWS console (gds aws notify-deploy-admin -l) and manually start an instance refresh on the autoscaling groups.

If this becomes an issue more commonly, GOV.UK Pay have implemented some changes to make the pipeline more robust that we might want to look in to:

If a worker gets stuck

You can restart all the notify workers here:

https://concourse.notify.tools/teams/notify/pipelines/info/jobs/start-worker-refresh/

That job requires a notify worker to function - if it doesn't work, you can restart from the "main" pipeline:

https://concourse.notify.tools/teams/main/pipelines/roll-instances/jobs/roll-notify-concourse-workers/

If that doesn't work then devs can log into AWS from the vpn (gds aws notify-deploy-admin -l) and manually initiate an instance refresh for the worker instances in the ec2 autoscaling groups.