Deploying Notify - alphagov/notifications-manuals GitHub Wiki

Target audience

This is a relatively non-technical overview of how to use the new deploy pipeline, aimed primarily at developers who want to make deployments to production. It explains the overall structure of the new pipeline, but does not go into detail about how the individual pieces work.

What’s new

A single pipeline for the whole of Notify, instead of separate pipelines for each application. This has several benefits:

  • It’s simpler and more maintainable
  • It only needs a single lock per environment (unless functional tests from outside the deploy-notify pipeline also need to use the environment)
  • It better captures the reality of how Notify is deployed "under the hood"
  • It removes the need for the special tricks that the old pipelines use to ensure that e.g.
    • the app pipelines won’t deploy a version of the infrastructure that hasn’t been deployed into the current environment yet
    • the app pipelines won’t deploy changes to other apps
  • It removes the risk that one pipeline is running the functional tests while another pipeline is applying Terraform, which undermined the value of the functional tests

Each environment now has its own separate Concourse team, with its own copy of the deploy-notify pipeline in it. The primary benefit of this is security: having a separate team for each environment allows us to limit the permissions that the Concourse workers have, and means we can move away from the model where the global Concourse workers have permission to do everything to every resource in every environment.

Continuous deployment by default - changes that are merged to main will automatically be released all the way to production unless the pipeline is instructed differently.

Deployment and testing as separate jobs - if the functional tests fail due to general flakiness, they can be retried in isolation instead of needing an entire new deployment.

Deployment bags

The new deploy pipeline is built around the concept of "deployment bags", which represent point-in-time snapshots of the current versions of all of Notify’s services.

The dev environments and preview each have their own pack-bag job. When this job is triggered, the current versions of all of its inputs are captured in a new version of the deployment bag. The bag is then deployed as a single unit.

The primary benefit of this is that it allows us to think of versions of Notify as a whole, and ensures that only combinations of services that have been tested together are able to be released to production.

The staging and production environments lack pack-bag jobs, and instead deploy the most recent version of the deployment bag that passed the previous environment (this will be detailed further below).

Screenshot 2025-01-30 at 13 54 34

The "meta-pipeline"

This is where the terminology gets slightly confusing. The below diagram shows a simplified overview of "the new deploy pipeline", which we sometimes choose to call the "meta-pipeline" to distinguish it from Concourse’s concept of "pipelines". The diagram shows the steps that a release goes through on its way to production.

The outer boxes in the diagram represent Concourse "teams", the boxes within those represent Concourse "pipelines", and the boxes within those are a simplified view of the Concourse "jobs" within each pipeline (in reality the "deploy" job in this diagram actually consists of several individual Concourse jobs, as described later).

d2(4)

Image building is handled by the pre-existing Concourse pipelines within the existing notify Concourse team, as it always has been.

When an image building job pushes a new image to ECR, this automatically triggers a run of pack-bag in the preview team. This captures/freezes the current versions of all of Notify’s services, so they can be deployed as a single unit.

When the new bag has been packed, this automatically triggers a run of deploy-notify in the preview team, which deploys that bag to the preview environment. If the deployment is successful and the tests pass, it tags that release of the deployment bag with the tag passed-preview-<timestamp>.

The staging and production environments lack pack-bag jobs, and instead are triggered when the deployment bag from the previous stage is tagged with the success tag. In this way, we can "chain" as many or as few environments together as we like.

The deploy-notify pipeline

deploy-notify is a Concourse pipeline that deploys the whole of Notify to a given environment.

Each environment has its own Concourse team (displayed in the sidebar), and each team has its own copy of the new deploy pipeline, named deploy-notify.

Screenshot 2025-01-30 at 15 32 55

The general structure of the new pipeline is as follows (though there are some environment-specific differences):

  • start-deploy acquires the lock for the current environment (among other things)
  • deploy performs the actual deployment, running Terraform and the db-setup script
  • test runs all of the appropriate test suites for the current environment
  • signal-deploy-completion releases the lock (among other things)

Testing

The test job runs all of the appropriate tests for the current environment. At the time of writing, this is:

  • Preview: Functional tests.
  • Staging: API client integration tests, smoke tests & provider tests.
    • This differs from the old world, in which the API client tests were handled by preview.
  • Production: Smoke tests & provider tests.

If the tests fail due to general flakiness, the test job may be retried via the normal Concourse mechanism, without requiring a new deployment.

If the tests fail due to a genuine issue with the new release, then the release will not be allowed to proceed to the next environment. In this case, you will need to roll back your changes, and release the deployment lock (both of these are discussed below).

The production pipeline additionally has a separate tab for the periodic smoke tests that run every 10 minutes.

Screenshot 2025-01-30 at 13 58 24

Locking

When a deployment to a given environment begins, the start-deploy job first acquires a lock, to prevent concurrent deployments to the same environment. Assuming the deployment is successful, the signal-deploy-completion job will then release the lock.

If a deployment is unsuccessful, either because the deployment itself failed, or because the tests failed, the lock will not be released. In the vast majority of cases (particularly in preview/staging/production), the correct thing to do is to re-trigger the failed job to give it a chance to pass the normal intended way (i.e. through the "release" reaching the final pipeline-unlocking job).

In some cases - those where you know for some reason the current deployment will never pass as-is and will require another release to "fix it" - you will need to manually invoke the force-unlock-pipeline job in the operator tab to unlock it, before the next deployment may take place. This should be a last resort in preview/staging/production, where we should hopefully rarely encounter broken deployments anyway.

Warning

Terraform itself also locks the state file during a deployment, so it is important not to interrupt or cancel the deploy job (though it is perfectly safe to interrupt the test job, if desired). In the event of the deploy job being stopped without releasing its lock, the release-terraform-lock job can be run from the operator tab to unlock it.

Note

Pipelines can also be configured to protect the functional-tests job with an externally-visible lock. This is used by the preview stage to allow the pull-request CI for the notifications-functional-tests repo to run against the preview environment without potentially interfering with a deployment that may also be taking place. This lock should rarely get stuck in a locked state as all uses of the lock try to automatically release the lock on failure, but as a last resort, the force-unlock-functional-tests-lock job can be run from the operator tab to unlock it. Though do check first that it's not just locked because notifications-functional-tests CI is running.

Runbook/common tasks

Reverting/rolling back a problematic app version from production

There are several approaches that can be taken if a problematic app release has made it to production, and each one is a trade-off between expediency and the level of disruption it causes to the release process for the rest of the team.

Warning

The same caveats around reverting deployments that have always applied continue to apply: be very careful when rolling back changes that involve database migrations. The deploy pipeline currently provides no mechanism to roll back a migration.

Ideal scenario: revert problematic PR in app's git repo

This doesn't actually involve the pipeline, but is "ideal" in the sense that it doesn't cause any pipeline blockage.

The disadvantage is that it can take a significant amount of time for the reverted app version to reach production. It takes as long as the app's image build takes + 3× app deployments (preview/staging/production) + the amount of time any of those stages' tests take. If the nature of the problem is too severe to wait this long, another option would be to pin a specific previous app version in pack-bag.

Pinning a specific version of an app

You can pin a specific version of an app, or mark a version of an app as "bad/broken" using the normal Concourse mechanisms, by selecting the appropriate input of the pack-bag job in the preview environment.

For example, to pin an older release of notifications-api, navigate to the deploy-notify pipeline in the preview team, select the pack-bag tab, select the app you want to pin/deselect, and use the normal Concourse ticks and pins. You can then manually trigger the pack-bag job to kick off a new deployment.

Screenshot 2025-01-30 at 13 56 24

This approach has the disadvantage that it "blocks" the pipeline for the pinned app as long as it is pinned. It should generally be used as a temporary measure to speed up roll-back while a PR reversion is being prepared (see above).

Deployment takes as long as 3× app deployments (preview/staging/production) + the amount of time any of those stages' tests take. If the nature of the problem is too severe to wait this long, another option would be to roll back the whole deployment in staging or production.

Rolling back a deployment

You can also pin or deselect a specific version of the entire deployment bag, using the normal Concourse mechanisms, by selecting the deployment-bag resource in staging or production.

For example, to roll back a broken deployment in the production environment, navigate to the deploy-notify pipeline in the production team, select the deployment-bag resource, and de-select the broken release by clicking the check mark next to it. You can then manually trigger the start-deploy job to perform the rollback.

Screenshot 2025-01-30 at 13 22 13

You can also freeze any environment at a specific version of the deployment bag by selecting the pin next to it, and manually triggering the start-deploy job.

Screenshot 2025-01-30 at 13 22 06

Tip

If a deployment is already in progress on a particular pipeline it's a good idea to let it finish before pinning the deployment-bag, as pinning will prevent any other deployment-bag version progressing to its next job and ultimately releasing the pipeline lock to allow the now-pinned one to start.

This approach has the disadvantage that it blocks the whole pipeline for all apps & components for as long as the pinning is in place. It should be used as temporary measure to speed up a roll-back while one of the above, less-severe reversion/roll-back methods is being prepared or making it through the pipeline. Resist the temptation to attempt "failing forward" and preparing a fix while this pinning is in place - rushed fixes have a poor success record and unhurried fixes can stop the whole team from working while they're being prepared.

The advantage of this approach is that it only takes as long as one deployment run (mostly because this exact deployment bag version has already been through the pipeline, it doesn't need testing again).

Once a stage has had its deployment bag pinned using this method, it is often a good idea to propagate the same pinning/unticking back through previous stages of the meta-pipeline so that subsequent deployments to those stages more closely simulate what will happen when that deployment reaches the problematic stage.

⚠️ **GitHub.com Fallback** ⚠️