Database Schema Rollback - HSLdevcom/jore4 GitHub Wiki

Database Rollback

This page describes how to back up and how to restore the JORE4 database's data and schema to a previous version.

Note that this how-to is only concerning the Azure Database for PostgreSQL (flexible server) instance. Other databases, like jore4-mapmatchingdb are not considered in this article (Anyway, jore4-mapmatchingdb is a stateless container and is filled up with data on start-up from the blob storage, thus no need to back up).

How to back up

Azure - automatically

Azure automatically creates a full snapshot of the database once a day. These snapshots by default are retained for a week (for production it's set to 30 days). Also the transaction logs are stored, so it's possible to restore a version of the database in any point of time (PITR). Read more about backups and restoring at azure

Azure - manually

In theory it's possible to manually create a snapshot of the current database in Azure with Backup Center. I tested it however and it only seemed to work with single-server PostgreSQL only. Anyway, the PITR restore should work for almost all of the cases.

pg_dump - manually

You could manually create a dump of the logical databases within the Azure PostgreSQL instance using the pg_dump tool. Remember that:

  • You should be connected to VPN (or from office IPs)
  • You should use the bastion host to create a tunnel to the database
  • Currently we have networkdb, timetablesdb, authdb and jore3importerdb logical databases. You probably want to back up at least networkdb and timetablesdb, but in some cases it's good to preserve the staging table fore jore3 importer or the auth session information.

How to fix/restore database

The instructions below uses the prod environment as an example. The same instructions work also with other environments.

First rule of doing anything major with the database (including restoring an older version) is to create a backup. In this case, you should:

  • jot down the exact UTC date and time before you start messing with the database, to be able to use the point in time restore (PITR)
    • if you are paranoid, you could also create backups of the logical databases with pg_dump (see above)
  • take note of the commit hash in jore4-flux repository for the prod branch to know that exactly which jore4-hasura version is deployed at the time. Hasura has the migrations that define that current database schema.
    • if you are paranoid and don't trust that the same Hasura version is deployed and running in Kubernetes as the one marked in jore4-flux, you should check what docker image is running from jore4-hasura (e.g. by checking hsl-jore4-prod-cluster / Workloads / jore4-hasura)

And important thing to note: as long as an X version of jore4-hasura container is running, every time it restarts it will run all the migrations and reach the X database state. If you manually make Y changes to the database, the jore4-hasura container might rerun the migrations and force the schema back to the X state.

Use-case 1: fix a broken SQL function, minor fix to broken SQL data

This is the most common and the easiest to fix. Some migration that you've run with Hasura has broken an SQL function or some data. If the error is minor, you could get away with making manual changes directly in the database. Remember that when jore4-hasura is restarted, it'll want to rerun the migrations, possibly breaking the said SQL function or data again. So you might wanna create a hotfix version of jore4-hasura and deploy that instead.

Use-case 2: major issue, want to roll back schema

(note on repeatable migrations and down migrations)

Use-case 3: everything is broken, full steam astern!

So hell broke loose and you need to revert to a previous state? My condolences... Here's what you need to do:

  1. Note the exact UTC time and the commit hash for the jore4-hasura image to be able come back to the current state in case you accidentally break even more things. (See above)

  2. Restore a previous version of the Azure PostgreSQL instance. See instructions here. Note that this will create a new instance of the original database with a previous state in it. Let's name it e.g. as hsl-jore4-prod-db-restored. This may take a long while if there were lots of transactions after the last snapshot. Leave the original hsl-jore4-prod-db instance still running, in parallel.

  3. Find the jore4-hasura image version that has a working state of migrations. This may either be a new "hotfix" version or an old version. Move the prod branch to the commit in jore4-flux repository that contains this working version of Hasura. This is necessary, because we want to make sure that when the restored version of the database is taken into use, the broken migrations won't break it again.

  4. Find the hsl-jore4-public-db-hostname secret in hsl-jore4-prod-vault and change it to the restored database instance's hostname (hsl-jore4-prod-db-restored.postgres.database.azure.com). This will make sure that when the Kubernetes containers are restarted, they will connect to the restored version of the database.

  5. Do restart the containers in Kubernetes to make sure they don't communicate with the old instance of the database. You may just kill the pods through Azure UI (hsl-jore4-prod-cluster / Workloads -> Delete!), new versions will respawn in a few seconds. Currently at least the following microservices should be restarted as they connect to the database directly: jore4-hasura, jore4-auth, jore4-jore3importer.

  6. Check that all the containers have started correctly and everything is functional. If not, you can try to restore an even earlier version of the database with the same steps. Or give up, go back to the original version before the restoration and try to fix the issues manually.

  7. Currently there are two versions of the database running, you may want to delete the original hsl-jore4-prod-db and rename hsl-jore4-prod-db-restore to hsl-jore4-prod-db (and refresh the secrets and the Kubernetes containers again to point at the correct hostname). Note that when the IaC scripts are sometimes rerun from the jore4-deploy repository, they expect the database to exist with the name hsl-jore4-prod-db.