Database Backup BCDevOps - bcgov/common-service-showcase GitHub Wiki
Note
read more here: Database-Backup
Production level databases need some form of automated backup strategy in order to satisfy certain retention and reliability guarantees. One of the ways we can do this is by exporting or dumping a database periodically.
Below are some of the resources that were used in order to achieve our backup strategy:
The main goal is to be able to generate full database snapshot dumps at a regular schedule and store these files in a different location. In order to achieve this, we need an off-cluster space to store exported files, and a scheduled job that will perform the database snapshots.
On OCP4, the storage class we will normally be using for backup is called netapp-file-backup
. More background details can be found here.
If your target namespace does not have a backup PVC yet, run the following template:
export NAMESPACE=YOURNAMESPACE
oc process -n $NAMESPACE -f https://raw.githubusercontent.com/wiki/bcgov/common-service-showcase/assets/backups-bcdevops/templates/backup.pvc.yaml -p NAMESPACE=$NAMESPACE -o yaml | oc -n $NAMESPACE apply -f -
You should expect to see a new PVC with the following pattern in your namespace: backup-YOURNAMESPACE
.
For managing the backup operations, we leverage BCDevOps/backup-container. The backup-container generates database dumps as well as database dump cycling based on a specified retention schedule. It supports two modes of operation: a legacy mode where there is a long-standing deployment which sleeps between job executions, and a CronJob mode which is managed by OpenShift. After evaluating both approaches, we have elected to use the CronJob mode as it is less resource intensive on the OCP platform, needing to have a running pod instantiated only when the backup task needs to run.
Note: At the time of writing, we are currently using version 2.9.0 of the backup-container.
On OCP4, you will need to ensure that your backup-postgres container is allowed to connect with the Patroni database. To do that, run the following template:
export NAMESPACE=YOURNAMESPACE
export INSTANCE=YOURAPPDBINSTANCE
oc process -n $NAMESPACE -f https://raw.githubusercontent.com/wiki/bcgov/common-service-showcase/assets/backups-bcdevops/templates/backup.np.yaml -p INSTANCE=$INSTANCE -o yaml | oc -n $NAMESPACE apply -f -
Note: The above template assumes you are connecting to a Patroni cluster. If you are connecting to a Postgres database instead, make sure you manually update the podSelector labels accordingly depending on your deployment configuration!
If you haven't already done so, make sure you have set up dockerhub auth credentials in your namespace for the builder. More details can be found here.
Note: As of backup-container v2.3.3, we no longer need to build the container images on the cluster. Instead, we now leverage the Docker images automatically generated by Github Actions.
In order to run backup-container, this image needs to be built. We can do this by running the following (replacing YOURNAMESPACE
with the appropriate variables):
export NAMESPACE=YOURNAMESPACE
export GIT_REF=2.6.1
oc process -n $NAMESPACE -f https://raw.githubusercontent.com/BCDevOps/backup-container/$GIT_REF/openshift/templates/backup/backup-build.json -p GIT_REF=$GIT_REF -o yaml | oc -n $NAMESPACE apply -f -
This will create a new build and imagestream named backup-postgres
which should be ready in a few minutes.
Once your NFS PVC has been provisioned and your Network Policies have been applied, you can then deploy the OpenShift CronJob which will manage your database backups. You can figure out what your NFS PVC is called with the following command:
export NAMESPACE=YOURNAMESPACE
oc get -n $NAMESPACE pvc
On OCP4. you will be looking for a PVC with StorageClass of netapp-file-backup
with a name following the pattern backup-YOURNAMESPACE
.
As of June 2024 we are taking advantage of the backup container's ability to save a secondary copy of the backup to an S3 bucket. Because of this we have modified the cronjob deployment template (to include the S3 credentials)
Create a secret in the same namespace named: backup-s3
with keys: S3_BUCKET
,S3_ENDPOINT
,S3_USER
and S3_PASSWORD
A copy of the template is here: backup-cronjob.yml Run the following, with the template file either stored locally or via a URL, updating the export variables as necessary:
export NAMESPACE=YOURNAMESPACE
export GIT_REF=2.9.0
export PVC=backup-$NAMESPACE
export DBNAME=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-name}' | base64 -d)
oc process \
-n $NAMESPACE \
-f "./backup-cronjob.yaml" \
-p JOB_PERSISTENT_STORAGE_NAME=$PVC \
-p IMAGE_REGISTRY=docker.io \
-p IMAGE_NAMESPACE=bcgovimages \
-p SOURCE_IMAGE_NAME=backup-container \
-p TAG_NAME=$GIT_REF \
-p DATABASE_SERVICE_NAME=patroni-master \
-p DATABASE_NAME=$DBNAME # eg: `app` or `ches` \
-p DATABASE_DEPLOYMENT_NAME=patroni-master \
-p DATABASE_USER_KEY_NAME=app-db-username \
-p DATABASE_PASSWORD_KEY_NAME=app-db-password \
-p SCHEDULE="0 8 * * *" \
-p DAILY_BACKUPS=7 \
-p WEEKLY_BACKUPS=8 \
-p MONTHLY_BACKUPS=3\
-o yaml | oc -n $NAMESPACE apply -f -\
Note that for the purposes of our backup retention policy, we have elected to have our CronJob run daily at 1 AM, and we will be keeping 7 daily backups, 8 weekly backups, and 3 monthly backups. This is a continuous rolling window as shown below:
Figure 3 - Rolling Retention Schedule Diagram
With this schedule, we will be retaining at minimum the state of the database 90 days ago, and at most 119 days ago (4 months, or 120 days minus 1). We will also have more frequent resolution for recent database backups coming from the daily and weekly backups.
In the event the CronJob parameters need to be updated, you will not be able to just oc apply and edit the existing CronJob object due to an int32 parsing error. Instead, you will need to know which part of the cronjob manifest you are intending on patching over. For example, if you want to pause/suspend an existing cronjob, you could run the following:
export NAMESPACE=YOURNAMESPACE
oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"suspend":true}}'
If you want to resume a paused cronjob, you could run the following:
export NAMESPACE=YOURNAMESPACE
oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"suspend":false}}'
If you want to change the cronjob's schedule, you could run the following:
export NAMESPACE=YOURNAMESPACE
oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"schedule":"0 8 * * *"}}'
If you wanted to change backup cronjobs across all of the namespaces you manage, you could run the following:
oc projects -q | xargs -t -n1 oc patch cronjob backup-postgres -p '{"spec":{"schedule":"0 8 * * *"}}' -n
Since there is an int32 parsing error when running oc apply
on an existing cronjob, you will unfortunately need to delete the existing cronjob first, before redeploying it. After the old cronjob is cleared, re-run the steps in the previous section.
To delete the CronJob object, you can run the following:
export NAMESPACE=YOURNAMESPACE
oc delete -n $NAMESPACE cronjob backup-postgres
Using the schedule "0 8 * * *"
, the cronjob will run with a new pod at 1 AM daily. However, manual runs can also be done if you're looking to get an immediate snapshot of the state of the database (potentially for auditing or restore/migration operations). You can manually run the cronjob controller with the following:
export NAMESPACE=<YOURNAMESPACE>
oc create -n $NAMESPACE job --from=cronjob/backup-postgres "backup-postgres-manual-$(date +%s)"
If the backup runs successfully, it will show as 'Completed', and you can inspect the logs.
Make sure to clean up your temporary pod when you are done with the following:
export NAMESPACE=<YOURNAMESPACE>
# Note: When there are no jobs to delete, you will get an error for oc delete.
oc get job -n $NAMESPACE -o name | grep -F -e '-manual-' | xargs oc delete -n $NAMESPACE
While there exists documentation on how to leverage the backup-container to perform a restore here, because we are running the backup in cron mode, there will not be any long-running pod containers that you would be able to oc rsh
into. Because of this, we will need to manually connect to our database and perform the restore procedure ourselves.
If the database in question is being regularly backed up with the backup-container, we can leverage the dump generated by that cronjob to perform the restore. In order to get the backup files, we will need to debug the pod representing the latest run of the cronjob. The following command will figure out which is the most recent pod, and start up a debug instance of it:
export NAMESPACE=<YOURNAMESPACE>
oc debug -n $NAMESPACE $(oc get -n $NAMESPACE pods -o name | grep "backup-postgres" | tail -n1)
Once your debug pod is running, you can run the command ./backup.sh -l
to list all of the backups available on the mounted backup PVC. Make a note of which directory you want to get - normally you will be looking for the latest one under the /backups/daily/<DATE>
directory, where DATE represents today in YYYY-MM-DD
format.
To copy the latest backup folder to your local machine, run the following:
export NAMESPACE=<YOURNAMESPACE>
mkdir backups
oc rsync -n $NAMESPACE $(oc get -n $NAMESPACE pods -o name | grep "debug$" | tail -n1 | sed "s|pod/||g"):/backups/daily/$(date +%Y-%m-%d)/ ./backups
If the operation was successful, you should see at least one file in the local backups folder with an extension of .sql.gz
. Make a note of this filename as you will need it later.
To wipe and restore the DB, you will need a working connection to your target database by running oc port-forward
. If you are connecting to a Patroni DB on OpenShift, you can run the following command:
export NAMESPACE=<YOURNAMESPACE>
export DBPORT=<YOURDBPORT>
oc -n $NAMESPACE port-forward $(oc -n $NAMESPACE get pods -o name --selector role=master,cluster-name=master) $DBPORT:5432
If successful, your local machine will be able to forward any network traffic to localhost:15432
to the remote Patroni cluster named master
. The subcommand will auto-select the correct Patroni replica which is serving as the master.
Before wiping the database, you must terminate ALL open network connections to the database. You should temporarily drop any dependent client application replica counts to 0. You may need to kill the master Patroni node if there are still residual connections preventing a database drop.
Assuming you have a working connection to your target database, run the following commands to do a hard wipe of the database. This is a destructive operation: make sure you know what you are doing before proceeding!
export DBHOST=<YOURDBHOST>
export DBPORT=<YOURDBPORT>
export DBNAME=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.app-db-name}' | base64 -d)
export DBSUPERUSER=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.superuser-username}' | base64 -d)
export DBUSER=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.app-db-username}' | base64 -d)
export PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.superuser-password}' | base64 -d)
# terminate current connections and drop database:
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '$DBNAME' AND pid <> pg_backend_pid()" -ac "DROP DATABASE $DBNAME;"
# create an empty database
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "CREATE DATABASE $DBNAME;"
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "GRANT ALL ON DATABASE $DBNAME TO $DBUSER;"
You can check to see if the database has been re-instantiated correctly with the following command:
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "\l"
Once you have verified you have a fresh and empty database from the previous section, you can leverage your backup file to repopulate the schema and restore your data. The following command will auto-select the latest modified file in your current working directory and use that as the source to restore from.
Note 1: The backup generated by backup-container will automatically assign each table to be owned by an app user. For most of our cases, the user will just be called app
. If you get any errors related to the role app
not being found, you can either hard redeploy your Patroni cluster, or manually jump into the DB and create the missing user.
_Note 2: roles app
, postgres
and replication
may already exist if patroni is running. You can edit your sql file to remove those CREATE role
statements.
Note 3: If you end up getting an error like stdin is not a tty
on Windows MINGW64, replace psql
with psql.exe
instead (SO Reference).
Note 4: DBPORT
can be any unused port on your system. However, using 15432
is usually a safe bet.
export DBHOST=<YOURDBHOST>
export DBPORT=<YOURDBPORT>
export DBNAME=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.app-db-name}' | base64 -d)
export DBSUPERUSER=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.superuser-username}' | base64 -d)
export DBUSER=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.app-db-username}' | base64 -d)
export PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.superuser-password}' | base64 -d)
gunzip -c $(ls -t | head -n1) | psql -v ON_ERROR_STOP=1 -x -h $DBHOST -p $DBPORT -U $DBSUPERUSER -d $DBNAME
You can check to see if the database tables have been restored correctly with the following command:
PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master-secret -o jsonpath='{.data.app-db-password}' | base64 -d) psql -h $DBHOST -p $DBPORT -U $DBUSER -d $DBNAME -c "\d"
If you temporarily set your client application replica count to 0, you should revert it back to the original count. Your application should now be functional assuming the database restore was successful.