Clone Donor Implementation - laurynas-biveinis/mysql-5.6 GitHub Wiki

The current version of this page is at https://github.com/facebook/mysql-5.6/wiki/Clone-Donor-Implementation

donor_data_structures-2

Multiple parallel clones (donor instances) are supported, with each having its own private RocksDB checkpoint (but the checkpoint SSTs may be shared through the hard links). The parallel clones have different locator IDs, and there is a map from locator ID to the donor instance. The checkpoint path names are constructed inside the RocksDB data directory, by creating a directory per checkpoint with .clone-checkpoint- prefix and monotonic counter suffix. The checkpoints are created under rocksdb_datadir and not under i.e. mysql_tmpdir so that they are close to the RocksDB data, because checkpointing depends on hardlinking. On server startup, the instance deletes any checkpoint directories it finds as they are temporary.

Each clone file goes through three states: not started, in progress, completed. The implementation has three separate maps from file ID for each state, and the state transition is performed by moving the metadata object between them. In regular operation the files proceed through the states forward, but a resumed clone will reset them according to the restart locator, which may result in some files being moved e.g. all the way from completed to not started.

Each donor instance goes through a sequence of states. They are not "states" in the clone plugin sense: client does not advance through them and does not need to acknowledge their changes to the donor.

Initial: the clone has been started but the RocksDB checkpoint has not been created yet. Note that clone_begin does not create the checkpoint, that is done later at the precopy stage.
Rolling Checkpoint: the clone pre-copy is in progress: the checkpoint has been created and is being rolled as necessary. The checkpoint is rolled by deleting it and re-creating it again. Releasing old checkpoint files that have been compacted since allows their deletion, reclaiming the disk space, and reduces eventual volume of WALs needing copying, decreasing the cloned instance startup time. The copied rolling checkpoint data is not necessarily consistent. That is ensured by the next states.
Final Checkpoint: RocksDB file deletions have been disabled and the final RocksDB checkpoint has been created, which will not be rolled again.
Final Checkpoint with Logs: same as the previous state with the live WAL information from the engine-synchronizing performance_schema.log_status query added to the clone files.

Same as in the Clone Client Implementation, the allocated task ID set is maintained for serializing main thread clone finish after all the worker threads completing. It also has a secondary purpose of verifying passed task IDs to the API entry points.

API Implementations

rocksdb_clone_begin(HA_CLONE_MODE_START): create or register a new locator, assign a new checkpoint ID, create the donor instance object in the Initial state, put it in the locator-to-object map, assign the task ID 0.
rocksdb_clone_begin(HA_CLONE_MODE_ADD_TASK): look up the donor instance by the locator, assign a new task ID.
rocksdb_clone_begin(HA_CLONE_MODE_RESTART): iterate over the applier state in the restart locator and reset file states accordingly. Assign the task ID 0.
rocksdb_clone_precopy: the first thread to call this creates the RocksDB checkpoint and advances the state to Rolling Checkpoint, iterates over the checkpoint directory to collect .SST file information. Then performs a copy loop: pick one not-started file & make it in-progress; prepare and send FILE_NAME_MAP_V1 packet, then send the whole file in chunks with accompanying FILE_CHUNK_V1 metadata packets. After each completed file check if the checkpoint should be rolled and do so. Once the rolling checkpoint is exhausted (by fully copying it or by hitting the configurable rolling limits), finish for worker threads. For the main thread, disable RocksDB file deletions and roll the checkpoint one last time to the final checkpoint, advancing the donor state.
rocksdb_clone_set_log_stop: add the provided WAL file names and sized to the not-started file set.
rocksdb_clone_copy: same copy loop as in rocksdb_clone_precopy with the difference that the checkpoint never rolls.
rocksdb_clone_ack: receive and report any clone error from the remote client. Never called in successful clone.
rocksdb_clone_end: unregister the task ID. If the main thread, wait for possible reconnects in the case of network errors, otherwise finish the clone, delete the checkpoint, reenable file deletions.

For multiple clone threads, this design provides file-level parallelism, as different threads will get different files at rocksdb_clone_copy call. This is similar to XtraBackup operation and different from InnoDB clone, which parallelizes at chunk level, i.e. a single file could be copied in parallel by multiple threads. This might make sense for InnoDB as it copies more objects than just files (buffer pages; redo log). For MyRocks file-level parallelism keeps implementation simple and might be extended later should the need arise.