Incident: data loss 2023 05 10 - RopeWiki/app GitHub Wiki
Update: the root cause an resolution can be found here.
This page documents the data loss incident of 2023-05-10 and investigation.
Events
On Thursday May 4, 2023, the RopeWiki site infrastructure was entirely replaced with a single VM running the containerized services in this repository. Directly following this transition, the two known outstanding TODOs were to fix email and to re-establish backups. Until this point, the db01
database server created nightly backups of the database contents in the form of .sql files, and then Dav had an automated job that synchronized these files to an off-site location. The transition was performed using the May 4 backup after confirming there were no relevant changes since the backup was made.
On Wednesday May 10, Ben merged #21 and then attempted to deploy that change by pulling the master branch changes to the production VM and running python3 deploy_tool.py prod redeploy db
. The intent of the infrastructure was that containers themselves do not contain any persistent data and are therefore cheap and disposable -- persistent data is stored in explicit volumes. The named, Docker-managed volume holding the MySQL data for the database container was ropewiki_database_storage, and backups were written to an external volume mount. The redeploy db
command was intended to remove the database container and then bring a new container up in its place without affecting any of the persistent storage. It is similar to the redeploy webserver
command which had been used successfully already on the new deployment. However, following the execution of the redeploy db
command, the production ropewiki.com site displayed an error message indicating a problem connecting to the database. Upon investigation of the db container logs, there was an error regarding the User column being in the wrong format which apparently turned out to be because the database had not been initialized because the ropewiki_database_storage persistent volume had been deleted and recreated. Looking in the VM folder mounted to contain nightly backups, only the May 4 backup .sql file was present, meaning that nightly backups had not been created as intended, which also meant there would not be an backups synchronized off-site since there were no on-site backups generated. Ben then forcefully removed the database container and followed the site deployment instructions starting at create_db and ending at restore_db to restore the May 4 backup. Following the successful execution of these instructions, the site appeared to be restored to its May 4 state and Ben posted a notice in the Facebook group.
After investigating the possibility that the Docker-managed MySQL volume might still be available somewhere to restore the 6 days of lost data, Ben concluded that it was likely no longer available. Named Docker volumes are stored in /var/lib/docker/volumes/
and while there was content in the subfolder named ropewiki_database_storage
, Ben presumed it was the overwritten, blank content at the time of inspection since the redeploy attempt had already occurred and the new database container was already reading as blank.
Causes
The data loss appears to have been caused by the combination of two main failures:
- The
redeploy db
command, which useddocker-compose rm -f -s db
, apparently removed/replaced the Docker-managed MySQL volume when that behavior was not intended by theredeploy db
command- Otherwise, the May 10 database state in the Docker-managed MySQL volume would not have been affected
- Nightly backups were not being created successfully
- Otherwise, the May 10 backup could have been used to restore the database, losing only the changes from the day of May 10
Plan
Ben intends to follow, and recommend following, the practices below to mitigate the risk of a similar situation in the future:
- Before making changes to the production site, make sure an off-site backup from within 24 hours is available
- This will limit impact to a 24-hour period in the majority of future incidents
- Perform first-of-kind changes (
redeploy db
had never been used before) on the dev site instance first before production- This will detect this kind of catastrophic failure with no impact to production
To resolve the specific issues identified in this incident, Ben plans to:
- Verify backups can be created manually
- Fix cron job installation in database container
- Fix
redeploy db
command to not remove named volume- Possibly consider using a mount to the VM file system directly, though this will move us away from the eventual goal of using a deployment manager like Kubernetes
Second loss 2023-05-12
On May 12, the database container in the production site was not accepting SSH connection requests at the externally-exposed port of 22001. Upon investigation, the container would not accept SSH connections directly from the VM itself either. Furthermore, attempting to run an interactive shell in the container to diagnose further, the command docker container exec -it prod_ropewiki_db_1 /bin/bash
failed as well. Ben removed the container and then restarted the site to create a new container, but this erased the database information for unknown reasons (expected behavior was that the named Docker volume would retain the persistent information through a container refresh). Ben deleted and re-created the database component from scratch and used a site backup from May 11 by Coops.