Resolving the repeated post migration database problem - RopeWiki/app GitHub Wiki
There were repeated problems with the database since moving it to the dockerized setup.
The incidents which results in data lose are outlined here. Each time a backup restore was needed to resolve the issue, and became the defacto response.
A third incident happen as I (coops) was investigating the issue. However this time I managed to resolve the issue without needing to use backups, and in the process resolve the underlying problem such that it shouldn't reoccur.
Spoiler: the mysql.user
table was corrupt.
Reproduction work flow
These were the steps to trigger the issue originally:
./deploy_tool.py dev create_db
./deploy_tool.py dev restore schema
./deploy_tool.py dev start_site
[dev site works]
run mysql `flush privileges`
[dev site works]
./deploy_tools.py dev restore_db
[dev site works (now full of content)]
run mysql `flush privileges`
[dev site broken!]
Ergo - there's something in the backup (and by extension the production database) which breaks authentication.
Investigation
I stared examining the mysql.users
database and comparing post-create_db
, post-restore_schema
, and post-restore_db
. There was clearly differences in the schema (some had a password
field, while others had authentication_string
field).
But one annoying issue was that as soon as you restored the backup and flushed privileges you lost all access. Therefore I attempted to create a user which was not part of production stack, with the hope that it would remain functional after a restore occurred.
This what I ran, and the result it returned:
mysql> create user 'coops'@'%' identified by 'zvU6sUPmCbs9pR_f';
ERROR 1805 (HY000): Column count of mysql.user is wrong. Expected 45, found 42. The table is probably corrupted
This clearly showed the problem. So I went to my dev setup to run the same command, and it also failed. I hadn't realize my initial attempt had broken the production site until it was mentioned in Slack.
Recovery
tl;dr - blow away mysql.user table, create an empty one and copy it into place, recreate root & ropewiki users.
Get docker-mysql into a recovery mode
Get access to inaccessible database:
# Stop the production mysql container
docker stop prod_ropewiki_db_1
# Start a new mysql container, with access to the same volume, but with a shell entrypoint
docker run -it --rm --name=mysql-reset-pass --entrypoint=/bin/bash -v ropewiki_database_storage:/var/lib/mysql ropewiki/database
# Enter the container
docker exec -it mysql-reset-pass bash
# Start mysql, but tell it to ignore authentication
root@ropewiki_db:/# mysqld_safe --skip-grant-tables
# Access mysql, no authentication needed
root@ropewiki_db:/# mysql
mysql>
At this point I tried various things like: deleting entries in mysql.user, create user...
, update...
. None were successful, mainly because the schema of the table was still wrong, and that in safe mode mysql won't process create user
queries.
I then tried to delete the table from /var/lib/mysql/mysql/user.*
and restart mysql, hoping it would recreate it, but it didn't.
My next idea was to trick mysql into creating a new initial setup, and copy that table across.
# Edit /etc/mysql/mysql.conf.d/mysqld.cnf to point it away from the production data
datadir = /tmp/coops
# Create a new empty mysql instance
root@ropewiki_db:/# mysqld --initialize
[ note the root password it gives you ]
# Copy the user table to the production database
root@ropewiki_db:/# cp -a /tmp/coops/mysql/user.* /var/lib/mysql/mysql/
# Revert the datadir in /etc/mysql/mysql.conf.d/mysqld.cnf
datadir = /var/lib/mysql
# Restart mysql
# Login with the new root password
root@ropewiki_db:/# mysql -uroot -p
# Fix the root password
mysql> update user set authentication_string=password('REAL_ROOT_PASSWD') where user='root';
# Create the ropewiki user
mysql> CREATE USER 'ropewiki'@'%' IDENTIFIED BY '$RW_DB_PASSWORD';
mysql> GRANT ALL PRIVILEGES ON *.ropewiki TO 'ropewiki'@'%';
mysql> FLUSH PRIVILEGES;
Where did the root issue come from?
My guess is that during the migration the mysql versions also changed, which changes the schema of the mysql.user
table. When deploy_tool.py create_db
created new users they would be added in an empty database using the new format, but then restore_db
would overwrite the table with what was taken from the old database. The problem wouldn't surface until mysql re-reads the mysql.user
table. This would occur on container restart, running flush privileges
manually, or when modify the table (e.g. adding a new user).
Prevention
My top suggestion to prevent this kind of problem in future is to restore only the ropewiki
database during a migration. Any mysql specific data (e.g. users) should be recreated separately.