Concourse - alphagov/notifications-manuals GitHub Wiki
This documentation is for running and maintaining Concourse. If you want to deploy a change to production you should see the guidance on merging and deploying
All our apps are deployed and our smoketests run through our own self-managed concourse instance.
This can be found at https://concourse.notify.tools/ (requires VPN to access)
Team
There are a few teams within Concourse.
- "Notify"
- Pipelines for building Notify images, and ancillary services. (Also updating those pipelines themselves). Will eventually be replaced by "final-image-builds", and "pull-requests" teams
- managed in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-notify.tf
- "Main"
- Pipelines for deploying Concourse itself (eg to change the boxes that concourse runs on, change the list of users who are allowed to access concourse)
- managed via the
main_team_github_users
(superadmins of the entire concourse) andmain_team_pipeline_operator_github_users
vars in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/site.tf
- "Final-image-builds"
- Pipelines for building production docker images ready to be deployed.
- "Pull-requests"
- Pipelines for running pull request checks and building demo images (for testing in the dev environments)
- "Aws-pr"
- Pipelines for pull request checks (terraform plan) for the notifications-aws repo
- "Preview", "Staging", "Production"
- Pipelines for deploying Notify images across three environments.
- "Dev-[a-d] and Dev-livemail"
- Pipelines for deploying Notify to dev environments (to test changes to infra, etc, without impacting the main pipelines), and also to dev-livemail environment (environment which accessible without VPN, used for prototype testing, user research, accessibility audits etc).
- managed via https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-dev.tf
There are a few user classes (strictly ordered, so for example members can also operate pipelines and view them as well)
- owner - can update user auth, create teams, etc. do anything.
- member - can update pipelines etc via the
fly
CLI - pipeline_operator - can trigger pipelines, pause pipelines, pin resources, and see progress, but can't update via the
fly
CLI - viewer - can only view pipeline progress
Viewing / Making changes to Notify's concourse pipelines
Using fly
You can use the fly
CLI to see and modify pipelines for the Notify team.
brew install fly
fly login -c https://concourse.notify.tools/ -n notify -t notify
Authentication is handled via Github.
Working with secrets
When Concourse needs access to secrets it gets them in two ways.
-
Concourse will access our credentials repo and retrieve secrets from it. This is generally used as part of a pipeline task.
-
Concourse will access secrets that we have stored in AWS SSM. This is generally used as part of resource configuration because we are unable to get secrets from our credentials repo whilst not in a task
Secrets then be referenced in resources by using the ((double-bracket syntax))
.
To put secrets from our credentials repo into AWS SSM for use outside of tasks, we have a concourse-secrets pipeline. This is configured in https://github.com/alphagov/notifications-aws/blob/master/concourse/concourse-secrets-pipeline.yml.j2.
Some secrets are separately put into AWS SSM as part of the creation of Concourse, for example names of S3 buckets that are created for pipelines to put files into. Secrets created in this way start with readonly
.
Monitoring our concourse instance
You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.
Making changes to our concourse instance
Our concourse instance is defined in two terraform repositories. They're split for legacy reasons.
Changes can be tested inside the concourse-staging environment. This will normally be in a "destroyed" state. Instructions on how to use the staging environment can be found here
Once changes are merged to either of these repos, you will need to trigger the deploy from concourse via the "deploy" pipeline. This will take 20 mins or so and may interrupt running jobs as the worker instances rotate, but is otherwise zero-downtime.
Concourse runs within the notify-deploy
AWS environment, and the role can be assumed using the gds cli by senior developers.
Updating the concourse version
The concourse version is fixed here. Also check to see if the resource has been manually pinned.
notifications-concourse-deployment
This repo defines some of the variables that you might expect to change, such as the definition of the info pipeline, how many AWS instances concourse has (and of what instance type), which github users have permission to view/edit the pipelines, the GDS IP addresses to allow access from and other similar variables.
This repo also contains instructions for how we created the concourse from scratch and thoughts from Reliability Engineering on how to manage it.
Concourse and concourse-staging use separate terraform projects. However they should be "roughly" kept inline.
notifications-concourse
This repo contains terraform that defines how concourse is hosted and how it interacts with itself e.g. ec2 instances, security groups, route53 DNS records, IAM roles, etc.
Troubleshooting
If a concourse deploy gets stuck
When applying terraform changes, concourse sometimes gets into a race condition e.g.
no workers satisfying: resource type 'git', version: '2.5'
We think this is because all the existing workers have been killed as part of the deployment. It's worth waiting a few minutes to see if new workers become available - try manually starting a new run of the job.
Otherwise, rotating the EC2 workers may have failed. Devs can log in to the AWS console (gds aws notify-deploy-admin -l
) and manually start an instance refresh on the autoscaling groups.
If this becomes an issue more commonly, GOV.UK Pay have implemented some changes to make the pipeline more robust that we might want to look in to:
If a worker gets stuck
You can restart all the notify workers here:
https://concourse.notify.tools/teams/notify/pipelines/info/jobs/start-worker-refresh/
That job requires a notify worker to function - if it doesn't work, you can restart from the "main" pipeline:
If that doesn't work then devs can log into AWS from the vpn (gds aws notify-deploy-admin -l
) and manually initiate an instance refresh for the worker instances in the ec2 autoscaling groups.