Concourse - alphagov/notifications-manuals GitHub Wiki

This documentation is for running and maintaining Concourse. If you want to deploy a change to production you should see the guidance on merging and deploying

All our apps are deployed and our smoketests run through our own self-managed concourse instance.

This can be found at https://concourse.notify.tools/ (requires VPN to access)

User authentication

Authentication is handled via Github. There are a few teams within Concourse.

There are a few user classes (strictly ordered, so for example members can also operate pipelines and view them as well)

  • owner - can update user auth, create teams, etc. do anything.
  • member - can update pipelines etc via the fly CLI
  • pipeline_operator - can trigger pipelines, pause pipelines, pin resources, and see progress, but can't update via the fly CLI
  • viewer - can only view pipeline progress

Making changes to Notify's concourse pipelines

Using fly

You can use the fly CLI to see and modify pipelines for the Notify team.

brew install fly

fly login -c https://concourse.notify.tools/ -n notify -t notify

Working with secrets

When Concourse needs access to secrets it gets them in two ways.

  1. Concourse will access our credentials repo and retrieve secrets from it. This is generally used as part of a pipeline task.

  2. Concourse will access secrets that we have stored in AWS SSM. This is generally used as part of resource configuration because we are unable to get secrets from our credentials repo whilst not in a task

Secrets then be referenced in resources by using the ((double-bracket syntax)).

To put secrets from our credentials repo into AWS SSM for use outside of tasks, we have a concourse-secrets pipeline. This is configured in https://github.com/alphagov/notifications-aws/blob/master/concourse/concourse-secrets-pipeline.yml.j2.

Some secrets are separately put into AWS SSM as part of the creation of Concourse, for example names of S3 buckets that are created for pipelines to put files into. Secrets created in this way start with readonly.

Monitoring our concourse instance

You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.

Making changes to our concourse instance

Our concourse instance is defined in two terraform repositories. They're split for legacy reasons. Once changes are merged to either of these repos, you will need to trigger the deploy from concourse via the "deploy" pipeline. This will take 20 mins or so and may interrupt running jobs as the worker instances rotate, but is otherwise zero-downtime.

Concourse runs within the notify-deploy AWS environment, and the role can be assumed using the gds cli by senior developers.

Updating the concourse version

Concourse will update itself to the latest version if you unpin the resource here: https://concourse.notify.tools/teams/main/pipelines/deploy/resources/concourse-release (Only notify admins can view and edit this pin)

notifications-concourse-deployment

This repo defines some of the variables that you might expect to change, such as the definition of the info pipeline, how many AWS instances concourse has (and of what instance type), which github users have permission to view/edit the pipelines, the GDS IP addresses to allow access from and other similar variables.

This repo also contains instructions for how we created the concourse from scratch and thoughts from Reliability Engineering on how to manage it.

notifications-concourse

This repo contains terraform that defines how concourse is hosted and how it interacts with itself e.g. ec2 instances, security groups, route53 DNS records, IAM roles, etc.

Troubleshooting

If a concourse deploy gets stuck

When applying terraform changes, concourse sometimes gets into a race condition e.g.

no workers satisfying: resource type 'git', version: '2.3'

We think this is because all the existing workers have been killed as part of the deployment. It's worth waiting a few minutes to see if new workers become available - try manually starting a new run of the job.

Otherwise, rotating the EC2 workers may have failed. Devs can log in to the AWS console (gds aws notify-deploy-admin -l) and manually start an instance refresh on the autoscaling groups.

If this becomes an issue more commonly, GOV.UK Pay have implemented some changes to make the pipeline more robust that we might want to look in to:

If a worker gets stuck

You can restart all the notify workers here:

https://concourse.notify.tools/teams/notify/pipelines/info/jobs/start-worker-refresh/

That job requires a notify worker to function - if it doesn't work, you can restart from the "main" pipeline:

https://concourse.notify.tools/teams/main/pipelines/roll-instances/jobs/roll-notify-concourse-workers/

If that doesn't work then devs can log into AWS from the vpn (gds aws notify-deploy-admin -l) and manually initiate an instance refresh for the worker instances in the ec2 autoscaling groups.