Databases (PostgreSQL) - alphagov/notifications-manuals GitHub Wiki

We run a Postgres 15 database using Amazon RDS.

Configuration

The RDS instance itself is configured in our Terraform code.

For Postgres users and extensions, these are configured using our DB setup script. This script is idempotent and is run by Concourse every time we deploy our Terraform code.

We recommend using pgcli as your database client. Run these commands in your terminal to install it:

brew tap dbcli/tap
brew install pgcli

Connecting to a database

To install the db-connect script, clone the notifications-aws repo and follow the install instructions.

See a list of all the roles you can use to by running:

gds-cli aws | grep notify

You can then connect to the database with an appropriate role:

Read only access

Developers should use this role by default. You'll only be able to read data, not write it. The postgres readonly user is automatically assumed by using the readonly AWS role.

gds aws notify-preview -- db-connect.sh notifydb -- pgcli

Write access

Developers should only use this role if you need to write data. When doing so, you should get any queries you run reviewed by another developer. The postgres write user is automatically assumed by using the -admin AWS role.

gds aws notify-preview-admin -- db-connect.sh notifydb -- pgcli

Admin access

Developers should only use this role in emergencies, when no other role has sufficient permissions. An example would be for running a vacuum statement.

gds aws notify-preview-admin -- db-connect.sh --role admin notifydb -- pgcli

Postgres users

We have 5 different Postgres users.

  • readonly - Can only read data, can't write it. This user should only be used by developers running manual queries
  • write - Can read and write. This user should only be used by developers running manual queries.
  • app-username-here - This is used by all our apps (including our db-migration app which runs our schema changes) and can read and write data. We may decided to split up our apps postgres users in the future to give them more granular permissions. Developers shouldn't need to use this user.
  • notify_schema_owner - You don't log in as this user, but when you log in as the write or app-username-here user you will automatically assume this role. This means that any schema changes done by the write or app-username-here user will all be owned by a single owner - notify_schema_owner. This avoids any problems with schema changes being owned by different Postgres users.
  • notifydb - AWS creates this user when it creates the database. We also call this the admin user. Developers can log in as this user if they need to, but should use the write user instead if possible.

Apart from the notifydb role, we create the rest ourselves using our DB setup script.

Logging

Logs are found in AWS Cloudwatch, in the /aws/rds/instance/notifydb/postgresql log group.

You can query them by going to the log group and clicking 'Search all log streams'. This is because there are 4 log streams because RDS writes logs in parallel.

The general postgres logs are logged.

All queries run by the readonly, write and "admin" (the notifydb postgres) user are logged.

TODO: Document slow query logging

Monitoring

RDS graphs

These are available in the AWS RDS dashboard in Grafana. You can also see the full set of RDS metrics in the AWS console.

RDS Performance Insights

We have enabled Performance Insights, which is a performance tuning and monitoring feature that helps you quickly assess the load on a database. It can show you why the bottlenecks on our database and the queries that may be causing them.

Alerting

We have warning and critical level alerting for CPU, memory and database connections. Warning level alerts will be sent via email. Critical alerts will set off Pagerduty.

Backups

Notify backups are stored for 31 days. All transaction logs are kept over this time, meaning that you can restore the database to any point in time between 31 days and 5 minutes ago.

Restoring a database

For some incidents (such as where a specific record has been deleted, or single table corrupted) you may wish to restore a copy of the database so you can extract the lost data, and re-import it in to the main notifydb. In this case you can follow the instructions below to restore the database, but do not need to fail over. In the event of a catastrophic database issue, you may need to restore the database and failover.

Before running a restore consider:
Do we need to treat this as an incident?

To restore a database you need to create a new RDS instance from a snapshot. This can be done using the AWS console or via the terraform. It is recommended to do this via terraform if you think you might need to cutover.

Terraform restore

This is the preferred approach to restoring a database especially if you ever think there is a possibility of cutting over to use this database. If you do not do it via terraform you will need to correct the terraform state before you can deploy again.

To restore a database from terraform you need to:

  • Create a new branch of notifications-aws.
  • Open the terraform/notify-infra/tfvars/<environment>.tfvars file
  • Find the section that looks like:
notifydb_rds_instances = {
  "notifydb" = {}
  # Uncomment the following to create a restore instance.
  # This is safe to do as it will not be used by any service.
  # "notifydb-2" = {
  #   "source_db_instance_identifier" = "notifydb",
  #   "restore_time"                  = "2025-05-30T12:00:00Z"
  # }
}
  • Update selection to uncomment out the second database. Replace the restore time to the one you want, e.g.:
notifydb_rds_instances = {
  "notifydb" = {}
  # Uncomment the following to create a restore instance.
  # This is safe to do as it will not be used by any service.
  "notifydb-2" = {
    "source_db_instance_identifier" = "notifydb",
    "restore_time"                  = "2025-06-02T12:00:00Z"
  }
}

Note: this database name can not be changed without downtime.

  • Commit the code change and push your branch.
  • Create a PR and closely check the PR checks for the environment you are using.
  • Merge PR or:

If you need to roll out faster than the pipeline will allow you can run the following from notifications-aws:

cd terraform/notify-infra
gds aws notify-<environment>-admin -- make <environment> apply

Console restore

Before you start the restore process from the console you need to:

  1. Login to the AWS console as an admin user
  2. Go to the Aurora and RDS console
  3. Click on Databases in the left hand menu
  4. Select the notifydb from the databases list
  5. From the 'Connectivity & security' tab find the 'Connectivity & security' panel and make a note of:
    5.1 Networking - VPC
    5.2 Security - VPC security group
  6. From the 'Maintenance & backups' tab find the 'Maintenance' panel and make a note of:
    6.1 Maintenance window (you will need to look at a calendar to see which day of the week this is)

Keep this tab open, so you can compare setting later.

To restore a database from the console open a new browser tab and:

  1. Go to the Aurora and RDS console
  2. Click on Databases in the left hand menu
  3. Select the notifydb from the databases list
  4. Click Actions -> Restore to point in time
  5. From the Restore time panel: 11.1. Select 'Latest restorable time' or the custom time you want to restore to
  6. From the 'setting' panel: 12.1 Set the 'DB instance identifier', for example notifydb-restore
  7. From the 'Availability and durability' panel:
    13.1. Choose Multi-AZ DB instance deployment (2 instances)
  8. From the 'Connectivity' panel:
    14.1 Make sure the 'VPC' setting is set to notify-vpc
    14.2 From the 'DB subnet group' choose notifydb[digits]
    14.3 Make sure the 'Public access' is set to No
    14.4 Set the 'Existing VPC security group' to be the value record in step 5.2. This is important. Without it, you cannot connect via the bastion and apps will not be able to connect.
    14.5. Change 'Availability Zone' to No Preference
  9. From the Additional configuration' panel (database options, encryption enabled, Monitoring enabled, maintenance):
    15.1 Set the 'DB parameter group' to notifydb-pg. 15.2 From the 'Log exports' options choose: PostgrSQL log and Upgrade log. 15.3 From the 'Maintenance window' options choose: Choose a window and apply the values from step 6.1
  10. Review the rest of the settings to ensure they are the same as the notifydb database
  11. Click Restore DB Instance or Restore to point in time from the bottom right

Get our applications to start using a restored database

This step is optional and should only be used in the worst case, such as when there is serious data loss or corruption in the database. In most cases you should be able to use pg_dump to extract the data you need from the restored database to restore to the original database.

Currently the application connects to the database via a CNAME. This means that we can change the CNAME to point to the new database and the application will automatically connect to the new database.

However, be warned that changing the DNS will not disconnect any existing connections to the old database. You'll need to ensure that all connections to the old database are closed before switching over. This can be achieved by stopping the old database. If you choose to stop the application instead, you will need to wait for the DNS TTL before starting the application again.

Before starting the failover you should pause all Concourse deployments to the environment you are failing over. Until the terraform is updated, a deployment will likely modify or recreate the notifydb.

Please note that a restore database cannot be renamed once it is created (and we cannot reuse the notifydb database name. Terraform will need to be updated to reflect the new db name.

This can be updated via the aws console or via terraform.

Failover using terraform

This is the preferred approach to cutting over the database. This should only be done if the database restore was done via terraform as well.

  • Create a new branch of notifications-aws.
  • Open the terraform/notify-infra/tfvars/<environment>.tfvars
  • Find the section that looks like:
live_notifydb_rds_instance = "notifydb"
  • Update this to be the new database name you created. E.g.
live_notifydb_rds_instance = "notifydb-2"
  • Commit the code change and push your branch.
  • Create a PR and closely check the PR checks for the environment you are using.
  • Merge PR or:

If you need to roll out faster than the pipeline will allow you can run the following from notifications-aws:

cd terraform/notify-infra
gds aws notify-<environment>-admin -- make <environment> apply

Failover using the AWS console

To update via the aws console:

  1. Pause all deployments to the environment you are working in. The terraform state will need to be manually fixed before deployments can resume. The terraform state manipulation will be non-trivial and that is why the terraform approach is preferred.
  2. Login to the AWS console as an admin user
  3. Go to the Aurora and RDS console
  4. Click on Databases in the left hand menu
  5. Select the notifydb from the databases list
  6. Click Actions followed by Stop temporarily (This is to ensure no existing connections are open)
  7. Click on 'Databases' in the left hand menu
  8. Select the restored DB.
  9. From the 'Connectivity & security' tab find the 'Connectivity & security' panel and copy the value for Endpoint & port - Endpoint
  10. Go to the route53 console
  11. Click on 'Hosted zones' in the left hand menu
  12. Select the zone with the description public zone. In prod this zone is called documents.service.gov.uk, but the name of the zone is different in each environment.
  13. Use the filter to search for db and select the tickbox next to CNAME record. The record name should be db.[name_of_zone], in prod the record name is db.notifications.service.gov.uk.
  14. Click the edit record button and change the 'value' setting to the new database endpoint copied in step 8.
  15. The application will automatically connect to the new database, after the dns ttl has expired
  16. Update the terraform state to reflect the new database.
  17. Perform a terraform plan to ensure a fail back will not happen before unpausing deployments

Post failover

Check the environment, prepare comms to customers (explaining what has been restored and what might have been lost), and update status page.

Cleanup

If you have made changes via the aws console. All changes will need to be imported into the terraform state. This can be done via tfmigrate or manually. Once this is done you can consider re-enabling the deployment pipeline.

Deleting RDS instances

To ensure our Disaster Recovery plan is effective, we have implemented a policy that restricts Admins from deleting RDS instances in the production account. This policy is in place to prevent accidental deletion of RDS instances by users, as well as to prevent compromised user accounts from deleting RDS instances. The policy is enforced at the top level AWS account, and you can find more information about it at the following link:

https://github.com/alphagov/aws-billing-account/blob/master/account_terraform/service_control_policies.tf#L90

If you need to delete an RDS instance, you will need to either have this service control policy removed to delete the RDS instance.

Upgrading our Postgres minor version

AWS will force us to upgrade our Postgres minor version occasionally. We've done this once before so you can use the following as a guide (but is OK to deviate as you see fit).

Test the upgrade against preview

  1. Change the value of auto_minor_version_upgrade value from false to true. Make sure this is just for the preview database.
  2. Wait for the maintenance window to happen, confirm that the upgrade was successful and preview is still working.
  3. Change the value of auto_minor_version_upgrade back to false for preview. Also change the RDS engine version to the new minor version for preview (otherwise Terraform will error).

Test the upgrade against staging

This test will be useful for checking roughly how much downtime we might expect. Rather than waiting for the maintenance window to happen, we will manually trigger the upgrade using the AWS user interface whilst doing a basic load test to see what happens

  1. Start a low volume load test running against the notify API status endpoint (consider using something like vegeta from your local machine)
  2. Manually click the upgrade button in AWS to upgrade staging
  3. Once the upgrade has finished, you can end the load test and see the results for how much downtime there was by looking at the error logs in Logit staging. We expect there to be less than a minute for minor upgrades. If there is more, then discuss with the team what to do
  4. Change the RDS engine version to the new minor version for staging (otherwise Terraform will error).

Put out comms and upgrade production

  1. If preview and staging have upgraded successfully, pick which week you want the upgrade to happen for production. It should be at least 2 weeks in the future so we can give our users at least 2 weeks notice. You should let the team know and anyone who is on support the night of the upgrade.
  2. Set up a scheduled maintenance on statuspage for the production upgrade. You should be able to copy the message from https://status.notifications.service.gov.uk/incidents/cw4c73pz8hd5. Automate as much of the scheduling as possible using statuspage so you don't need to manually update anything when the maintenance window starts. You should send notifications to announce the scheduled maintenance but the in progress and completed notifications are optional.
  3. A few days before the expected day of the production upgrade, check in the AWS interface that pending maintenance is set for the production database, and it is upgrading to the version you expect. Then you should change the value for auto_minor_version_upgrade to be true for production.
  4. Production will then upgrade automatically in our maintenance window. We expect it to pagerduty the person on call who will need to make sure that it recovers successfully.
  5. Change the value of auto_minor_version_upgrade back to false for production. Also change the RDS engine version to the new minor version for production (otherwise Terraform will error). This can be done the morning after the upgrade.

Using a copy of the staging database locally

You might want to use a copy of one of our environment's databases locally, for example to test a database migration.

You can grab this data using pg_dump. This command takes at least half an hour to run.

gds aws notify-staging -- db-connect.sh notifydb -- pg_dump \
    -Fc \
    -f staging-db-dump \
    --exclude-schema=hint_plan \
    --exclude-schema=pglogical \
    --exclude-table=notifications \
    --exclude-table=notification_history \
    --exclude-table=ft_billing \
    --exclude-table=ft_notification_status

This grabs a copy of the tables, excluding the two schemas (which produce errors as pg_dump does not have permission to read them - they're related to plugins so not critical) and four mentioned tables (our four biggest tables, which make up 99% of the 440GB staging db size - without them, the dump file is 400mb).

Note that by excluding them, this includes the schema as well as the data! You'll need to run an appropriate create table command later if you want to use and repopulate those tables

You can restore it to a local copy by running:

psql -c "create database local_test"
pg_restore -v -d local_test staging-db-dump
export SQLALCHEMY_DATABASE_URI="postgresql://localhost/local_test"
# you can now run commands like `flask db upgrade` against this db