Databases (PostgreSQL) - alphagov/notifications-manuals GitHub Wiki
We run a Postgres 15 database using Amazon RDS.
Configuration
The RDS instance itself is configured in our Terraform code.
For Postgres users and extensions, these are configured using our DB setup script. This script is idempotent and is run by Concourse every time we deploy our Terraform code.
We recommend using pgcli as your database client. Run these commands in your terminal to install it:
brew tap dbcli/tap
brew install pgcli
Connecting to a database
To install the db-connect script, clone the notifications-aws repo and follow the install instructions.
See a list of all the roles you can use to by running:
gds-cli aws | grep notify
You can then connect to the database with an appropriate role:
Read only access
Developers should use this role by default. You'll only be able to read data, not write it. The postgres readonly
user is automatically assumed by using the readonly
AWS role.
gds aws notify-preview -- db-connect.sh notifydb -- pgcli
Write access
Developers should only use this role if you need to write data. When doing so, you should get any queries you run reviewed by another developer. The postgres write
user is automatically assumed by using the -admin
AWS role.
gds aws notify-preview-admin -- db-connect.sh notifydb -- pgcli
Admin access
Developers should only use this role in emergencies, when no other role has sufficient permissions. An example would be for running a vacuum
statement.
gds aws notify-preview-admin -- db-connect.sh --role admin notifydb -- pgcli
Postgres users
We have 5 different Postgres users.
readonly
- Can only read data, can't write it. This user should only be used by developers running manual querieswrite
- Can read and write. This user should only be used by developers running manual queries.app-username-here
- This is used by all our apps (including our db-migration app which runs our schema changes) and can read and write data. We may decided to split up our apps postgres users in the future to give them more granular permissions. Developers shouldn't need to use this user.notify_schema_owner
- You don't log in as this user, but when you log in as thewrite
orapp-username-here
user you will automatically assume this role. This means that any schema changes done by thewrite
orapp-username-here
user will all be owned by a single owner -notify_schema_owner
. This avoids any problems with schema changes being owned by different Postgres users.notifydb
- AWS creates this user when it creates the database. We also call this theadmin
user. Developers can log in as this user if they need to, but should use thewrite
user instead if possible.
Apart from the notifydb
role, we create the rest ourselves using our DB setup script.
Logging
Logs are found in AWS Cloudwatch, in the /aws/rds/instance/notifydb/postgresql
log group.
You can query them by going to the log group and clicking 'Search all log streams'. This is because there are 4 log streams because RDS writes logs in parallel.
The general
postgres logs are logged.
All queries run by the readonly
, write
and "admin" (the notifydb
postgres) user are logged.
TODO: Document slow query logging
Monitoring
RDS graphs
These are available in the AWS RDS dashboard in Grafana. You can also see the full set of RDS metrics in the AWS console.
RDS Performance Insights
We have enabled Performance Insights, which is a performance tuning and monitoring feature that helps you quickly assess the load on a database. It can show you why the bottlenecks on our database and the queries that may be causing them.
Alerting
We have warning and critical level alerting for CPU, memory and database connections. Warning level alerts will be sent via email. Critical alerts will set off Pagerduty.
Backups
Notify backups are stored for 31 days. All transaction logs are kept over this time, meaning that you can restore the database to any point in time between 31 days and 5 minutes ago.
Restoring a database
For some incidents (such as where a specific record has been deleted, or single table corrupted) you may wish to restore a copy of the database so you can extract the lost data, and re-import it in to the main notifydb
. In this case you can follow the instructions below to restore the database, but do not need to fail over. In the event of a catastrophic database issue, you may need to restore the database and failover.
Before running a restore consider:
Do we need to treat this as an incident?
- Following the incident process
- Keeping notes in an incident log
- Updating statuspage to show that there is an issue with Notify that we are working on
To restore a database you need to create a new RDS instance from a snapshot. This can be done using the AWS console or via the terraform. It is recommended to do this via terraform if you think you might need to cutover.
Terraform restore
This is the preferred approach to restoring a database especially if you ever think there is a possibility of cutting over to use this database. If you do not do it via terraform you will need to correct the terraform state before you can deploy again.
To restore a database from terraform you need to:
- Create a new branch of
notifications-aws
. - Open the
terraform/notify-infra/tfvars/<environment>.tfvars
file - Find the section that looks like:
notifydb_rds_instances = {
"notifydb" = {}
# Uncomment the following to create a restore instance.
# This is safe to do as it will not be used by any service.
# "notifydb-2" = {
# "source_db_instance_identifier" = "notifydb",
# "restore_time" = "2025-05-30T12:00:00Z"
# }
}
- Update selection to uncomment out the second database. Replace the restore time to the one you want, e.g.:
notifydb_rds_instances = {
"notifydb" = {}
# Uncomment the following to create a restore instance.
# This is safe to do as it will not be used by any service.
"notifydb-2" = {
"source_db_instance_identifier" = "notifydb",
"restore_time" = "2025-06-02T12:00:00Z"
}
}
Note: this database name can not be changed without downtime.
- Commit the code change and push your branch.
- Create a PR and closely check the PR checks for the environment you are using.
- Merge PR or:
If you need to roll out faster than the pipeline will allow you can run the following from notifications-aws:
cd terraform/notify-infra
gds aws notify-<environment>-admin -- make <environment> apply
Console restore
Before you start the restore process from the console you need to:
- Login to the AWS console as an admin user
- Go to the
Aurora and RDS
console - Click on Databases in the left hand menu
- Select the
notifydb
from the databases list - From the 'Connectivity & security' tab find the 'Connectivity & security' panel and make a note of:
5.1 Networking - VPC
5.2 Security - VPC security group - From the 'Maintenance & backups' tab find the 'Maintenance' panel and make a note of:
6.1 Maintenance window (you will need to look at a calendar to see which day of the week this is)
Keep this tab open, so you can compare setting later.
To restore a database from the console open a new browser tab and:
- Go to the
Aurora and RDS
console - Click on Databases in the left hand menu
- Select the
notifydb
from the databases list - Click
Actions
->Restore to point in time
- From the Restore time panel: 11.1. Select 'Latest restorable time' or the custom time you want to restore to
- From the 'setting' panel:
12.1 Set the 'DB instance identifier', for example
notifydb-restore
- From the 'Availability and durability' panel:
13.1. ChooseMulti-AZ DB instance deployment (2 instances)
- From the 'Connectivity' panel:
14.1 Make sure the 'VPC' setting is set tonotify-vpc
14.2 From the 'DB subnet group' choosenotifydb[digits]
14.3 Make sure the 'Public access' is set toNo
14.4 Set the 'Existing VPC security group' to be the value record in step 5.2. This is important. Without it, you cannot connect via the bastion and apps will not be able to connect.
14.5. Change 'Availability Zone' toNo Preference
- From the Additional configuration' panel (database options, encryption enabled, Monitoring enabled, maintenance):
15.1 Set the 'DB parameter group' tonotifydb-pg
. 15.2 From the 'Log exports' options choose:PostgrSQL log
andUpgrade log
. 15.3 From the 'Maintenance window' options choose:Choose a window
and apply the values from step 6.1 - Review the rest of the settings to ensure they are the same as the
notifydb
database - Click
Restore DB Instance
orRestore to point in time
from the bottom right
Get our applications to start using a restored database
This step is optional and should only be used in the worst case, such as when there is serious data loss or corruption in the database. In most cases you should be able to use pg_dump to extract the data you need from the restored database to restore to the original database.
Currently the application connects to the database via a CNAME. This means that we can change the CNAME to point to the new database and the application will automatically connect to the new database.
However, be warned that changing the DNS will not disconnect any existing connections to the old database. You'll need to ensure that all connections to the old database are closed before switching over. This can be achieved by stopping the old database. If you choose to stop the application instead, you will need to wait for the DNS TTL before starting the application again.
Before starting the failover you should pause all Concourse deployments to the environment you are failing over. Until the terraform is updated, a deployment will likely modify or recreate the notifydb
.
Please note that a restore database cannot be renamed once it is created (and we cannot reuse the notifydb
database name. Terraform will need to be updated to reflect the new db name.
This can be updated via the aws console or via terraform.
Failover using terraform
This is the preferred approach to cutting over the database. This should only be done if the database restore was done via terraform as well.
- Create a new branch of
notifications-aws
. - Open the
terraform/notify-infra/tfvars/<environment>.tfvars
- Find the section that looks like:
live_notifydb_rds_instance = "notifydb"
- Update this to be the new database name you created. E.g.
live_notifydb_rds_instance = "notifydb-2"
- Commit the code change and push your branch.
- Create a PR and closely check the PR checks for the environment you are using.
- Merge PR or:
If you need to roll out faster than the pipeline will allow you can run the following from notifications-aws:
cd terraform/notify-infra
gds aws notify-<environment>-admin -- make <environment> apply
Failover using the AWS console
To update via the aws console:
- Pause all deployments to the environment you are working in. The terraform state will need to be manually fixed before deployments can resume. The terraform state manipulation will be non-trivial and that is why the terraform approach is preferred.
- Login to the AWS console as an admin user
- Go to the
Aurora and RDS
console - Click on Databases in the left hand menu
- Select the
notifydb
from the databases list - Click
Actions
followed byStop temporarily
(This is to ensure no existing connections are open) - Click on 'Databases' in the left hand menu
- Select the restored DB.
- From the 'Connectivity & security' tab find the 'Connectivity & security' panel and copy the value for Endpoint & port - Endpoint
- Go to the route53 console
- Click on 'Hosted zones' in the left hand menu
- Select the zone with the description
public zone
. In prod this zone is calleddocuments.service.gov.uk
, but the name of the zone is different in each environment. - Use the filter to search for
db
and select the tickbox next to CNAME record. The record name should bedb.[name_of_zone]
, in prod the record name isdb.notifications.service.gov.uk
. - Click the edit record button and change the 'value' setting to the new database endpoint copied in step 8.
- The application will automatically connect to the new database, after the dns ttl has expired
- Update the terraform state to reflect the new database.
- Perform a terraform plan to ensure a fail back will not happen before unpausing deployments
Post failover
Check the environment, prepare comms to customers (explaining what has been restored and what might have been lost), and update status page.
Cleanup
If you have made changes via the aws console. All changes will need to be imported into the terraform state. This can be done via tfmigrate or manually. Once this is done you can consider re-enabling the deployment pipeline.
Deleting RDS instances
To ensure our Disaster Recovery plan is effective, we have implemented a policy that restricts Admins from deleting RDS instances in the production account. This policy is in place to prevent accidental deletion of RDS instances by users, as well as to prevent compromised user accounts from deleting RDS instances. The policy is enforced at the top level AWS account, and you can find more information about it at the following link:
If you need to delete an RDS instance, you will need to either have this service control policy removed to delete the RDS instance.
Upgrading our Postgres minor version
AWS will force us to upgrade our Postgres minor version occasionally. We've done this once before so you can use the following as a guide (but is OK to deviate as you see fit).
Test the upgrade against preview
- Change the value of
auto_minor_version_upgrade
value fromfalse
totrue
. Make sure this is just for the preview database. - Wait for the maintenance window to happen, confirm that the upgrade was successful and preview is still working.
- Change the value of
auto_minor_version_upgrade
back tofalse
for preview. Also change the RDS engine version to the new minor version for preview (otherwise Terraform will error).
Test the upgrade against staging
This test will be useful for checking roughly how much downtime we might expect. Rather than waiting for the maintenance window to happen, we will manually trigger the upgrade using the AWS user interface whilst doing a basic load test to see what happens
- Start a low volume load test running against the notify API status endpoint (consider using something like
vegeta
from your local machine) - Manually click the upgrade button in AWS to upgrade staging
- Once the upgrade has finished, you can end the load test and see the results for how much downtime there was by looking at the error logs in Logit staging. We expect there to be less than a minute for minor upgrades. If there is more, then discuss with the team what to do
- Change the RDS engine version to the new minor version for staging (otherwise Terraform will error).
Put out comms and upgrade production
- If preview and staging have upgraded successfully, pick which week you want the upgrade to happen for production. It should be at least 2 weeks in the future so we can give our users at least 2 weeks notice. You should let the team know and anyone who is on support the night of the upgrade.
- Set up a scheduled maintenance on statuspage for the production upgrade. You should be able to copy the message from https://status.notifications.service.gov.uk/incidents/cw4c73pz8hd5. Automate as much of the scheduling as possible using statuspage so you don't need to manually update anything when the maintenance window starts. You should send notifications to announce the scheduled maintenance but the in progress and completed notifications are optional.
- A few days before the expected day of the production upgrade, check in the AWS interface that pending maintenance is set for the production database, and it is upgrading to the version you expect. Then you should change the value for
auto_minor_version_upgrade
to betrue
for production. - Production will then upgrade automatically in our maintenance window. We expect it to pagerduty the person on call who will need to make sure that it recovers successfully.
- Change the value of
auto_minor_version_upgrade
back tofalse
for production. Also change the RDS engine version to the new minor version for production (otherwise Terraform will error). This can be done the morning after the upgrade.
Using a copy of the staging database locally
You might want to use a copy of one of our environment's databases locally, for example to test a database migration.
You can grab this data using pg_dump. This command takes at least half an hour to run.
gds aws notify-staging -- db-connect.sh notifydb -- pg_dump \
-Fc \
-f staging-db-dump \
--exclude-schema=hint_plan \
--exclude-schema=pglogical \
--exclude-table=notifications \
--exclude-table=notification_history \
--exclude-table=ft_billing \
--exclude-table=ft_notification_status
This grabs a copy of the tables, excluding the two schemas (which produce errors as pg_dump does not have permission to read them - they're related to plugins so not critical) and four mentioned tables (our four biggest tables, which make up 99% of the 440GB staging db size - without them, the dump file is 400mb).
Note that by excluding them, this includes the schema as well as the data! You'll need to run an appropriate
create table
command later if you want to use and repopulate those tables
You can restore it to a local copy by running:
psql -c "create database local_test"
pg_restore -v -d local_test staging-db-dump
export SQLALCHEMY_DATABASE_URI="postgresql://localhost/local_test"
# you can now run commands like `flask db upgrade` against this db