WAL - VioletGiraffe/cpp-db GitHub Wiki

Disk-aware, SSD-friendly WAL operates in chunks equal in size to the underlying disk's block size.

Basic principles

  • All modifying operations are first logged in consecutive order.
  • Then the data is submitted to the storage, but not flushed immediately. When the storage is flushed, a "transaction complete" message is logged.
  • A block-sized buffer is accumulated for the WAL data. Once a block is full, it is flushed and it is at this point that the transaction can be confirmed to the client.
  • Records in the log need unique IDs / numbers to indicate that a previously pending operation has completed on the storage (or failed). It would be possible to patch operation result back into its original log entry, but that's write amplification and poor I/O performance.

TODO: does the WAL need to know if the transaction succeeded or failed?

Problem: a single writer, or 2-3-4 writers, may never be able to fill up the block if they're waiting for one write to succeed before commencing the next.
Solution: flush on timer even if the block is incomplete.

WAL storage format

There is no need to checksum each entry, only the block needs a checksum.

Field Size Offset from start
Complete entry SIZE (2 bytes max for 4K blocks) sizeof(EntrySizeType) bytes 0 bytes
Operation ID 4 bytes sizeof(EntrySizeType) bytes
Serialized operation DATA var. length sizeof(EntrySizeType) + 4 bytes

Block format:

  • Number of entries (2 bytes for 4K block size)
  • Entry 1
  • ...
  • Entry N
  • Checksum over the whole block (4 bytes)

Usable space: block size - 4 (checksum) - 2 (num of entries) bytes

WAL entry types

  • Create record
  • Update record
  • Delete record
  • Append to array
  • Transaction complete (success / failure)

Recovery

  1. Scan the log from the beginning upwards in [blocksize] blocks (e. g. 4K).
    If the file size is not a multiple of [blocksize] - assert and abort, this shouldn't happen.
  2. Verify the block checksum. If it doesn't match, discard the block. There should only be at most one faulting block - the last one. And probably not even that.
  3. Track the list of all pending operations. Remove an op from the list when "operation complete" record for it is encountered.
  4. When the end of the log is reached, replay all the pending operations in the same order.

TODO:

  • Checkpoints
  • Trimming the log online