WAL - VioletGiraffe/cpp-db GitHub Wiki
Disk-aware, SSD-friendly WAL operates in chunks equal in size to the underlying disk's block size.
Basic principles
- All modifying operations are first logged in consecutive order.
- Then the data is submitted to the storage, but not flushed immediately. When the storage is flushed, a "transaction complete" message is logged.
- A block-sized buffer is accumulated for the WAL data. Once a block is full, it is flushed and it is at this point that the transaction can be confirmed to the client.
- Records in the log need unique IDs / numbers to indicate that a previously pending operation has completed on the storage (or failed). It would be possible to patch operation result back into its original log entry, but that's write amplification and poor I/O performance.
TODO: does the WAL need to know if the transaction succeeded or failed?
Problem: a single writer, or 2-3-4 writers, may never be able to fill up the block if they're waiting for one write to succeed before commencing the next.
Solution: flush on timer even if the block is incomplete.
WAL storage format
There is no need to checksum each entry, only the block needs a checksum.
Field | Size | Offset from start |
---|---|---|
Complete entry SIZE (2 bytes max for 4K blocks) | sizeof(EntrySizeType) bytes | 0 bytes |
Operation ID | 4 bytes | sizeof(EntrySizeType) bytes |
Serialized operation DATA | var. length | sizeof(EntrySizeType) + 4 bytes |
Block format:
- Number of entries (2 bytes for 4K block size)
- Entry 1
- ...
- Entry N
- Checksum over the whole block (4 bytes)
Usable space: block size - 4 (checksum) - 2 (num of entries) bytes
WAL entry types
- Create record
- Update record
- Delete record
- Append to array
- Transaction complete (success / failure)
Recovery
- Scan the log from the beginning upwards in [blocksize] blocks (e. g. 4K).
If the file size is not a multiple of [blocksize] - assert and abort, this shouldn't happen. - Verify the block checksum. If it doesn't match, discard the block. There should only be at most one faulting block - the last one. And probably not even that.
- Track the list of all pending operations. Remove an op from the list when "operation complete" record for it is encountered.
- When the end of the log is reached, replay all the pending operations in the same order.
TODO:
- Checkpoints
- Trimming the log online