Emergency rollback of an app deployment - alphagov/notifications-manuals GitHub Wiki

As long as we have the old-style separate-app-oriented pipelines (originally designed around the PaaS deployment model) deploying our new ECS-based apps, some of the more complex corners of pipeline use are going to be awkward. Rolling deployments back is one of those awkward operations.

Not all deployments of a bad app are going to need this sort of action - apps which are so broken they don't "come up" properly (respond to their healthchecks) should prevent the deployment proceeding to the point it will start tearing down the old "working" applications. A subsequent deployment of a fixed application should allow the deployment to complete.

The long-term remedy to a broken deployment is to revert the problematic PR and allow that change to propagate through the stages like any other merged PR. This approach is preferred if the app is only slightly broken or is only preventing a minor feature from working properly.

In some extraordinary cases an app release will make it through to production, respond successfully to healthchecks yet be unable to serve traffic for some reason. Because of the successful healthchecks, a deployment of this app will continue through to shutting down the old (working) app instances, leaving you with only the broken instances. If this is breaking a major Notify feature, waiting for a PR reversion to propagate through the release pipeline might not be an option. In such a case it may be tempting to trigger an app pipeline's "deploy-production" job using a previous build's "Re-run with same inputs" button - but this is not a safe thing to do. This is because that button will pick all of that previous build's input resource versions, including the version of notifications-aws used in that previous deployment, which may be hours, days or even weeks out of date. If significant changes have been made to underlying infrastructure in that time, such a deployment would attempt to partially revert that, which could be disastrous. This is why deployments of apps must always be done with the latest version of notifications-aws that is tagged as having been deployed to that stage by the notify-infra pipeline.

Instead, the most straightforward way to rollback such an app without a full new app release is probably to log in to the AWS console as admin and roll back the Task Definition of the ECS Service corresponding to the app in question via clickops. This is similar to the procedure described in Deploying a demo app image to preview before merging. To do this, you can list Task Definitions from the top-level ECS page:

When choosing the Task Definition revision, you will probably need to change the listing's "Filter status" dropdown to show "Inactive" revisions, otherwise the most recent revisions won't be visible.

Once you've chosen a revision to roll back to, you'll need to "Create a new revision" based on it:

Once you've created the new revision the procedure to deploy it will be the same as in Deploying a demo app image to preview before merging