Disk aware, SSD friendly storage - VioletGiraffe/cpp-db GitHub Wiki

Disk-aware: acknowledges that while disks are byte-addressable, in reality there is a smallest unit of reading and writing, e. g. 4 KiB. A write across the block boundary may be torn.
SSD-friendly: updating (writing out) less than the SSD block size creates write amplification.

In practice, the DB block size should be equal to the FS cluster size, which should be equal to the disk block size.

Assumptions

It is presumed that the hardware guarantees atomicity when writing exactly the block-sized amount of data. Paranoidal checksumming should not be required.
IDs (keys) are unique. ATTEMPTING TO ADD A DUPLICATE KEY MUST ABORT!

Paged database storage

The DB block, from now on called page to distinguish from the physical storage block, is the smallest unit of disk IO, but at the same time it's the largest atomic unit. Using smaller IO requests is both inefficient and problematic for post-crash recovery (because attempting to write a single byte will physically touch more than that, resulting in unexpected corruption of the surrounding data in case of a malfunction). Using larger units is not atomic which creates additional challenges for crash recovery.

TODO: Assert that DB page size is equal to the FS cluster size. If possible, also check the hardware block size.
The storage file consists of a whole number of pages. Each page can be populated with data records (tuples) completely or partially.
Instead of referring to records by physical byte-level file offset, page number and offset within the page can be used.

Page structure

The page offset could either be physical byte offset relative to the start of the page, or it can be the number of the slot - an area containing the offset, which is recorded in the page itself. At 4 byte alignment and 4K page, 10 bits is required to store PgO. At 8 bits per entry, the alignment is 16 (records smaller than 16 bytes will waste some space).
Compromise: 16 bits per slot for addressing down to single byte. Varlen records cannot have a very small size, so using 2 bytes per slot is acceptable.
Slots start from the first byte of the page and grow up, records are being added at the end of the page and grow inwards.
Slot has value 0 if it's empty. An empty slot is the last slot.
Tables with fixed size records do not require slots. Records are added from the beginning of the page + 2 bytes. The first 2 bytes of the page indicate the number of records stored, and the specific record can be found via linear search by key. TODO: use AVX for scanning?

Mutable vs immutable storage

Mutable means that updating or deleting a record is done in-place. Pros:
Efficient use of disk space if there are many updates.
New records can be inserted in place of deleted ones - Efficient use of disk space if there are many deletions. Cons:
Requires tracking the free space after deleting records - another data structure that needs to be reliably persisted.
Not SSD-frienly: updating records in-place will require flushing complete pages where only a single record has changed. Same for creating new records in place of the deleted ones.
This also leads to poor I/O write performance in bytes per second: need to flush the whole page to persist a change in a single record.
Immutable: updating = creating a new record and removing the old one from the index. New records do not reuse the space of deleted ones. This space can be reclaimed by an offline process from time to time (compacting records and updating indices accordingly). Pros:
New records can be buffered as they are created and only written out when the page is full - best possible I/O throughput and no write amplification. It's even possible to buffer more than one page and write sequential chunks of more than one page. Cons:
The opposite of mutable pros.
Deleted record should still be tracked and their list persisted for subsequent cleanup. It is possible, but very ineffecient, to recover this information through full DB scan.

Indexing records in the storage

Index associates key value with the number of the page that contains the record (or records) with this key. The required record can either be found by scanning the page, or by directly storing in the index the location of the record in the page (offset or slot number).

Stable page numbers: it's possible to add a layer of indirection to eiminate the need to update the index in some cases: the page number may not directly correspond to its location on the disk, instead, it can referer to another table that maps page numbers to physical locations. Then it's possible to move the page without changing the index.
In practice, the pages only have to move during compaction (reclaiming free space), it should be OK to update the index, at least until the storage is really huge.

Example: 1 PiB storage, 4 KiB page. Number of bits to store PgN: 50 (petabyte) - 12 (4K) = 38. Useful choice: 40 bits for PgN (up to 4 Petabytes) = 5 bytes.

Assuming 8 TiB worth of data = 2^31 pages, the required storage for indexing these pages with memory-efficient flat hash map roughly is 2^31 * (8 bytes per record ID (key) + 5 bytes per page number + 1 byte hash table overhead) = 2^31 * 14 = 2 GiB * 14 = 28 GiB :'(

Assuming 8 TiB worth of data = 2^31 pages as above, storing just the PgN in the index requires 28 GiB :/ Might as well index complete record locations.
But without any metadata inside the page it's impossible to tell where record boundaries are if their size is not fixed. Hence, table of contents is required if the schema has var length fields.

Page cache

Pages are loaded into a RAM cache as they are referenced for reading. Referencing a record loads the whole page that contains that record. The hardware always transfers a full block, so this does not hurt IOPS, and caching the blocks (pages) there is at least a small chance of hitting the cache instead of the disk.

The cache has a max size cap. After that, a page has to be evicted for a new one to be loaded. Some of the possible eviction policies:
- Random
- LRU
Ordering the records in a way that promotes locality of references would improve throughput (and latency) significantly!

There is a separate cache for pages being filled with new records, pending a flush.

Ideally, the dirty page will be written out when it's full, but that may not always be feasible. Flush on timeout?
Should reads be allowed to reference the dirty pages? Yes, because the data is already in the WAL, but this might pessimize the code instead of optimizing it if it's very unlikely to hit the dirty page with a read.

Deleted records list (garbage map)

A simple list of all the deleted records (PgNs and record IDs) for offline compaction.