Database Schema Rollback - HSLdevcom/jore4 GitHub Wiki
Database Rollback
This page describes how to back up and how to restore the JORE4 database's data and schema to a previous version.
Note that this how-to is only concerning the Azure Database for PostgreSQL (flexible server) instance. Other databases, like jore4-mapmatchingdb are not considered in this article (Anyway, jore4-mapmatchingdb is a stateless container and is filled up with data on start-up from the blob storage, thus no need to back up).
How to back up
Azure - automatically
Azure automatically creates a full snapshot of the database once a day. These snapshots by default are retained for a week (for production it's set to 30 days). Also the transaction logs are stored, so it's possible to restore a version of the database in any point of time (PITR). Read more about backups and restoring at azure
Azure - manually
In theory it's possible to manually create a snapshot of the current database in Azure with Backup Center. I tested it however and it only seemed to work with single-server PostgreSQL only. Anyway, the PITR restore should work for almost all of the cases.
pg_dump - manually
You could manually create a dump of the logical databases within the Azure PostgreSQL instance using the pg_dump tool. Remember that:
- You should be connected to VPN (or from office IPs)
- You should use the bastion host to create a tunnel to the database
- Currently we have networkdb, timetablesdb, authdb and jore3importerdb logical databases. You probably want to back up at least networkdb and timetablesdb, but in some cases it's good to preserve the staging table fore jore3 importer or the auth session information.
How to fix/restore database
The instructions below uses the prod
environment as an example. The same instructions work also with other environments.
First rule of doing anything major with the database (including restoring an older version) is to create a backup. In this case, you should:
- jot down the exact UTC date and time before you start messing with the database, to be able to use the point in time restore (PITR)
- if you are paranoid, you could also create backups of the logical databases with
pg_dump
(see above)
- if you are paranoid, you could also create backups of the logical databases with
- take note of the commit hash in
jore4-flux
repository for theprod
branch to know that exactly whichjore4-hasura
version is deployed at the time. Hasura has the migrations that define that current database schema.- if you are paranoid and don't trust that the same Hasura version is deployed and running in Kubernetes as the one marked in
jore4-flux
, you should check what docker image is running fromjore4-hasura
(e.g. by checkinghsl-jore4-prod-cluster
/ Workloads /jore4-hasura
)
- if you are paranoid and don't trust that the same Hasura version is deployed and running in Kubernetes as the one marked in
And important thing to note: as long as an X
version of jore4-hasura
container is running, every time it restarts it will run all the migrations and reach the X
database state. If you manually make Y
changes to the database, the jore4-hasura
container might rerun the migrations and force the schema back to the X
state.
Use-case 1: fix a broken SQL function, minor fix to broken SQL data
This is the most common and the easiest to fix. Some migration that you've run with Hasura has broken an SQL function or some data. If the error is minor, you could get away with making manual changes directly in the database. Remember that when jore4-hasura
is restarted, it'll want to rerun the migrations, possibly breaking the said SQL function or data again. So you might wanna create a hotfix version of jore4-hasura
and deploy that instead.
Use-case 2: major issue, want to roll back schema
(note on repeatable migrations and down migrations)
Use-case 3: everything is broken, full steam astern!
So hell broke loose and you need to revert to a previous state? My condolences... Here's what you need to do:
-
Note the exact UTC time and the commit hash for the
jore4-hasura
image to be able come back to the current state in case you accidentally break even more things. (See above) -
Restore a previous version of the Azure PostgreSQL instance. See instructions here. Note that this will create a new instance of the original database with a previous state in it. Let's name it e.g. as
hsl-jore4-prod-db-restored
. This may take a long while if there were lots of transactions after the last snapshot. Leave the originalhsl-jore4-prod-db
instance still running, in parallel. -
Find the
jore4-hasura
image version that has a working state of migrations. This may either be a new "hotfix" version or an old version. Move theprod
branch to the commit injore4-flux
repository that contains this working version of Hasura. This is necessary, because we want to make sure that when the restored version of the database is taken into use, the broken migrations won't break it again. -
Find the
hsl-jore4-public-db-hostname
secret inhsl-jore4-prod-vault
and change it to the restored database instance's hostname (hsl-jore4-prod-db-restored.postgres.database.azure.com
). This will make sure that when the Kubernetes containers are restarted, they will connect to the restored version of the database. -
Do restart the containers in Kubernetes to make sure they don't communicate with the old instance of the database. You may just kill the pods through Azure UI (
hsl-jore4-prod-cluster
/ Workloads -> Delete!), new versions will respawn in a few seconds. Currently at least the following microservices should be restarted as they connect to the database directly:jore4-hasura
,jore4-auth
,jore4-jore3importer
. -
Check that all the containers have started correctly and everything is functional. If not, you can try to restore an even earlier version of the database with the same steps. Or give up, go back to the original version before the restoration and try to fix the issues manually.
-
Currently there are two versions of the database running, you may want to delete the original
hsl-jore4-prod-db
and renamehsl-jore4-prod-db-restore
tohsl-jore4-prod-db
(and refresh the secrets and the Kubernetes containers again to point at the correct hostname). Note that when the IaC scripts are sometimes rerun from thejore4-deploy
repository, they expect the database to exist with the namehsl-jore4-prod-db
.