Postgres databases - alphagov/notifications-manuals GitHub Wiki

We run a Postgres 15 database using Amazon RDS.

Configuration

The RDS instance itself is configured in our Terraform code.

For Postgres users and extensions, these are configured using our DB setup script. This script is idempotent and is run by Concourse every time we deploy our Terraform code.

We recommend using pgcli as your database client. Run these commands in your terminal to install it:

brew tap dbcli/tap
brew install pgcli

Connecting to a database

To install the db-connect script, clone the notifications-aws repo and follow the install instructions.

You can then connect to the database with an appropriate role:

Read only access

Developers should use this role by default. You'll only be able to read data, not write it. The postgres readonly user is automatically assumed by using the readonly AWS role.

gds aws notify-preview -- db-connect.sh notifydb -- pgcli

Write access

Developers should only use this role if you need to write data. When doing so, you should get any queries you run reviewed by another developer. The postgres write user is automatically assumed by using the -admin AWS role.

gds aws notify-preview-admin -- db-connect.sh notifydb -- pgcli

Admin access

Developers should only use this role in emergencies, when no other role has sufficient permissions. An example would be for running a vacuum statement.

gds aws notify-preview-admin -- db-connect.sh --role admin notifydb -- pgcli

Postgres users

We have 5 different Postgres users.

readonly - Can only read data, can't write it. This user should only be used by developers running manual queries
write - Can read and write. This user should only be used by developers running manual queries.
app-username-here - This is used by all our apps (including our db-migration app which runs our schema changes) and can read and write data. We may decided to split up our apps postgres users in the future to give them more granular permissions. Developers shouldn't need to use this user.
notify_schema_owner - You don't log in as this user, but when you log in as the write or app-username-here user you will automatically assume this role. This means that any schema changes done by the write or app-username-here user will all be owned by a single owner - notify_schema_owner. This avoids any problems with schema changes being owned by different Postgres users.
notifydb - AWS creates this user when it creates the database. We also call this the admin user. Developers can log in as this user if they need to, but should use the write user instead if possible.

Apart from the notifydb role, we create the rest ourselves using our DB setup script.

Logging

Logs are found in AWS Cloudwatch, in the /aws/rds/instance/notifydb/postgresql log group.

You can query them by going to the log group and clicking 'Search all log streams'. This is because there are 4 log streams because RDS writes logs in parallel.

The general postgres logs are logged.

All queries run by the readonly, write and "admin" (the notifydb postgres) user are logged.

TODO: Document slow query logging

Monitoring

RDS graphs

These are available in the AWS RDS dashboard in Grafana. You can also see the full set of RDS metrics in the AWS console.

RDS Performance Insights

We have enabled Performance Insights, which is a performance tuning and monitoring feature that helps you quickly assess the load on a database. It can show you why the bottlenecks on our database and the queries that may be causing them.

Alerting

We have warning and critical level alerting for CPU, memory and database connections. Warning level alerts will be sent via email. Critical alerts will set off Pagerduty.

Backups

Notify backups are stored for 31 days. All transaction logs are kept over this time, meaning that you can restore the database to any point in time between 31 days and 5 minutes ago.

Restoring a database

To restore a database you need to create a new RDS instance from a snapshot. This can be done using the AWS console or via the terraform.

To restore a database from the console you need to:

Go to the RDS console
Select the database you want to restore
Click Actions -> Restore to point in time
Select 'Latest restorable time' or the custom time you want to restore to
Enter a new DB instance identifier, for example notifydb-restore
Set the Availability & durability setting as appropriate for the situation. If replacing notifydb then Multi-AZ, if to simply query then Don not create a standby instance
Change Availability Zone to No Preference
Set the VPC to Notify VPC
Set the subnet group to notifydb
Set the security group to be the same as set for notifydb. This is important. Without it, you cannot connect via the bastion and apps will not be able to connect.
Set the DB parameter group to notifydb-pg
Set the backup window to the time set for notifydb
Review the rest of the settings to ensure they are the same as the original database
Click Restore DB Instance

To restore a database from terraform you need to:

Go to the environment folder you want to restore. Open the new-migration-resources.tf file.
Copy the module "rds" block and paste it just below.
Add/Replace the following config to the new module

module "rds_restore" {
  ...
  db_identifier                = "notifydb-restore"
  dedicated_kms_key            = false
  external_kms_key             = module.rds.kms_id
  apply_immediately            = true

  restore_to_point_in_time = {
    source_db_instance_identifier = "notifydb"
    restore_time = "2023-10-02T10:00:00Z"
  }
}

Adjust the 'restore_time' to meet your needs. For the latest use use_latest_restorable_time: true.

Ensure you allow access from the bastion by adjusting the allowed_services_sg_ids in the bastion module.

module "bastion" {
  ...
  allowed_services_sg_ids = [
    ...
    { "name" : "rds_restore", "sg_id" : module.rds_restore.db_security_group_id, "port" : 5432 },
  ]
}

Use the concourse pipelines or run terraform manually to deploy.

Get our applications to start using a restored database

This step is optional and should only be used in the worst case, such as when there is serious data loss or corruption in the database. In most cases you should be able to use pg_dump to extract the data you need from the restored database to restore to the original database.

Currently the application connects to the database via a CNAME. This means that we can change the CNAME to point to the new database and the application will automatically connect to the new database.

However, be warned that changing the DNS will not disconnect any existing connections to the old database. You'll need to ensure that all connections to the old database are closed before switching over. This can be achieved by stopping the old database. If you choose to stop the application instead, you will need to wait for the DNS TTL before starting the application again.

This can be updated via the aws console or via terraform.

To update via the aws console:

Go to the RDS console
Select the database you want to move away from
Click Actions followed by Stop temporarily (This is to ensure no existing connections are open)
From the RDS dashboard, choose the restored DB. Copy the database endpoint URL.
Go to the route53 console
Select the hosted zone for the environment
Select the db CNAME.
Change the value to the new database endpoint copied in step 4.
The application will automatically connect to the new database, after the dns ttl has expired

To update via terraform:

Go to the environment folder you want to restore. Open the new-migration-resources.tf file.
Update the module "rds_route53" and update the record to point to the new module.rds_restore.host. E.g.

module "rds_route53" {
  ...
  subdomain = "db"

  records = [
    { "name" : "notify", "host" : module.rds_restore.host, "weight" : 100 }
  ]
}

Update the output security_group_ids and refer to the new rds module. E.g.

output "security_group_ids" {
  value = {
    ...
    "rds"                                     = module.rds_restore.db_security_group_id
  }
}

Run both the environment terraform and the application terraform, ideally via concourse.

Cleanup

If you have made changes via the aws console. All changes will need to be imported into the terraform state. This can be done via tfmigrate or manually.

Deleting RDS instances

To ensure our Disaster Recovery plan is effective, we have implemented a policy that restricts Admins from deleting RDS instances in the production account. This policy is in place to prevent accidental deletion of RDS instances by users, as well as to prevent compromised user accounts from deleting RDS instances. The policy is enforced at the top level AWS account, and you can find more information about it at the following link:

https://github.com/alphagov/aws-billing-account/blob/master/account_terraform/service_control_policies.tf#L90

If you need to delete an RDS instance, you will need to either have this service control policy removed to delete the RDS instance.

Upgrading our Postgres minor version

AWS will force us to upgrade our Postgres minor version occasionally. We've done this once before so you can use the following as a guide (but is OK to deviate as you see fit).

Test the upgrade against preview

Change the value of auto_minor_version_upgrade value from false to true. Make sure this is just for the preview database.
Wait for the maintenance window to happen, confirm that the upgrade was successful and preview is still working.
Change the value of auto_minor_version_upgrade back to false for preview. Also change the RDS engine version to the new minor version for preview (otherwise Terraform will error).

Test the upgrade against staging

This test will be useful for checking roughly how much downtime we might expect. Rather than waiting for the maintenance window to happen, we will manually trigger the upgrade using the AWS user interface whilst doing a basic load test to see what happens

Start a low volume load test running against the notify API status endpoint (consider using something like vegeta from your local machine)
Manually click the upgrade button in AWS to upgrade staging
Once the upgrade has finished, you can end the load test and see the results for how much downtime there was by looking at the error logs in Logit staging. We expect there to be less than a minute for minor upgrades. If there is more, then discuss with the team what to do
Change the RDS engine version to the new minor version for staging (otherwise Terraform will error).

Put out comms and upgrade production

If preview and staging have upgraded successfully, pick which week you want the upgrade to happen for production. It should be at least 2 weeks in the future so we can give our users at least 2 weeks notice. You should let the team know and anyone who is on support the night of the upgrade.
Set up a scheduled maintenance on statuspage for the production upgrade. You should be able to copy the message from https://status.notifications.service.gov.uk/incidents/cw4c73pz8hd5. Automate as much of the scheduling as possible using statuspage so you don't need to manually update anything when the maintenance window starts. You should send notifications to announce the scheduled maintenance but the in progress and completed notifications are optional.
A few days before the expected day of the production upgrade, check in the AWS interface that pending maintenance is set for the production database, and it is upgrading to the version you expect. Then you should change the value for auto_minor_version_upgrade to be true for production.
Production will then upgrade automatically in our maintenance window. We expect it to pagerduty the person on call who will need to make sure that it recovers successfully.
Change the value of auto_minor_version_upgrade back to false for production. Also change the RDS engine version to the new minor version for production (otherwise Terraform will error). This can be done the morning after the upgrade.

Using a copy of the staging database locally

You might want to use a copy of one of our environment's databases locally, for example to test a database migration.

You can grab this data using pg_dump. This command takes at least half an hour to run.

gds aws notify-staging -- db-connect.sh notifydb -- pg_dump \
    -Fc \
    -f staging-db-dump \
    --exclude-schema=hint_plan \
    --exclude-schema=pglogical \
    --exclude-table=notifications \
    --exclude-table=notification_history \
    --exclude-table=ft_billing \
    --exclude-table=ft_notification_status

This grabs a copy of the tables, excluding the two schemas (which produce errors as pg_dump does not have permission to read them - they're related to plugins so not critical) and four mentioned tables (our four biggest tables, which make up 99% of the 440GB staging db size - without them, the dump file is 400mb).

Note that by excluding them, this includes the schema as well as the data! You'll need to run an appropriate create table command later if you want to use and repopulate those tables

You can restore it to a local copy by running:

psql -c "create database local_test"
pg_restore -v -d local_test staging-db-dump
export SQLALCHEMY_DATABASE_URI="postgresql://localhost/local_test"
# you can now run commands like `flask db upgrade` against this db