Resolving the repeated post migration database problem - RopeWiki/app GitHub Wiki

There were repeated problems with the database since moving it to the dockerized setup.

The incidents which results in data lose are outlined here. Each time a backup restore was needed to resolve the issue, and became the defacto response.

A third incident happen as I (coops) was investigating the issue. However this time I managed to resolve the issue without needing to use backups, and in the process resolve the underlying problem such that it shouldn't reoccur.

Spoiler: the mysql.user table was corrupt.

Reproduction work flow

These were the steps to trigger the issue originally:

./deploy_tool.py dev create_db
./deploy_tool.py dev restore schema
./deploy_tool.py dev start_site
  [dev site works]
run mysql `flush privileges`
  [dev site works]
./deploy_tools.py dev restore_db
  [dev site works (now full of content)]
run mysql `flush privileges`
  [dev site broken!]

Ergo - there's something in the backup (and by extension the production database) which breaks authentication.

Investigation

I stared examining the mysql.users database and comparing post-create_db, post-restore_schema, and post-restore_db. There was clearly differences in the schema (some had a password field, while others had authentication_string field).

But one annoying issue was that as soon as you restored the backup and flushed privileges you lost all access. Therefore I attempted to create a user which was not part of production stack, with the hope that it would remain functional after a restore occurred.

This what I ran, and the result it returned:

mysql> create user 'coops'@'%' identified by 'zvU6sUPmCbs9pR_f';
ERROR 1805 (HY000): Column count of mysql.user is wrong. Expected 45, found 42. The table is probably corrupted

This clearly showed the problem. So I went to my dev setup to run the same command, and it also failed. I hadn't realize my initial attempt had broken the production site until it was mentioned in Slack.

Recovery

tl;dr - blow away mysql.user table, create an empty one and copy it into place, recreate root & ropewiki users.

Get docker-mysql into a recovery mode

Get access to inaccessible database:

# Stop the production mysql container
docker stop prod_ropewiki_db_1

# Start a new mysql container, with access to the same volume, but with a shell entrypoint
docker run -it --rm --name=mysql-reset-pass --entrypoint=/bin/bash -v ropewiki_database_storage:/var/lib/mysql ropewiki/database

# Enter the container
docker exec -it mysql-reset-pass bash

# Start mysql, but tell it to ignore authentication
root@ropewiki_db:/# mysqld_safe --skip-grant-tables

# Access mysql, no authentication needed
root@ropewiki_db:/# mysql
mysql>

At this point I tried various things like: deleting entries in mysql.user, create user..., update.... None were successful, mainly because the schema of the table was still wrong, and that in safe mode mysql won't process create user queries.

I then tried to delete the table from /var/lib/mysql/mysql/user.* and restart mysql, hoping it would recreate it, but it didn't.

My next idea was to trick mysql into creating a new initial setup, and copy that table across.

# Edit /etc/mysql/mysql.conf.d/mysqld.cnf to point it away from the production data
datadir         = /tmp/coops

# Create a new empty mysql instance
root@ropewiki_db:/# mysqld --initialize
[ note the root password it gives you ]

# Copy the user table to the production database
root@ropewiki_db:/# cp -a /tmp/coops/mysql/user.* /var/lib/mysql/mysql/

# Revert the datadir in /etc/mysql/mysql.conf.d/mysqld.cnf
datadir         = /var/lib/mysql

# Restart mysql

# Login with the new root password
root@ropewiki_db:/# mysql -uroot -p

# Fix the root password
mysql> update user set authentication_string=password('REAL_ROOT_PASSWD') where user='root';

# Create the ropewiki user
mysql> CREATE USER 'ropewiki'@'%' IDENTIFIED BY '$RW_DB_PASSWORD';
mysql> GRANT ALL PRIVILEGES ON *.ropewiki TO 'ropewiki'@'%';
mysql> FLUSH PRIVILEGES;

Where did the root issue come from?

My guess is that during the migration the mysql versions also changed, which changes the schema of the mysql.user table. When deploy_tool.py create_db created new users they would be added in an empty database using the new format, but then restore_db would overwrite the table with what was taken from the old database. The problem wouldn't surface until mysql re-reads the mysql.user table. This would occur on container restart, running flush privileges manually, or when modify the table (e.g. adding a new user).

Prevention

My top suggestion to prevent this kind of problem in future is to restore only the ropewiki database during a migration. Any mysql specific data (e.g. users) should be recreated separately.