Storage implementation details - VioletGiraffe/cpp-db GitHub Wiki

General concepts

  • The index is in RAM and only serialized/deserialized on startup and graceful exit. This is fine because index can always be reconstructed by scanning the main data storage file (until it gets too large - then an index storage mechanism can be devised, but the important thing is that the index is always recoverable as long as the data is consistent).

  • Every record must have a unique primary key field.

  • Data is never modified or removed in-place on the disk - all items are persistent. New data is always appended to the end of the storage file. Whether it is insertion of the new item or update of an existing one ultimately makes no difference.

  • Deleted or updated data is not marked in-place in any specific way. DBMS knows a record is obsolete when it discovers another record with the same key further down the storage file.

  • A separate data structure might be used to track deleted items for cleaning it from the storage file, but this introduces another possibility for inconsistency after process crash.

  • The existence of an item can be tested by querying the index.

  • The index is not persistent, it is updated on-the-fly to reflect the true current state of the storage. When an item is updated the index only points to the new (most recent) value, and when it is deleted from the database it is deleted from the index.

  • Any write to disk is flushed right away!

  • WAL: only writes (append / update / delete) need to go through WAL. Only one write operation is in flight at any given moment.

Consequences:

  1. Write operations against the storage are executed one by one, i. e. at most one is ever in progress.
  2. Hence, write batching is not available - severe inefficiency, and horrible write amplification for SSDs.

At the same time, since new data is always appended to end of the storage block, write batching is very natural, but then WAL needs to have a number of pending write operations.

Create

Stages:

  1. Write to WAL.
  2. Write the new record to storage.
  3. Update the index.
  4. Mark the operation in WAL as successful.

2 and 3 are mutually independent; 4 depends on 2 and 3; 2 and 3 depend on 1.

Read

No WAL entry necessary since it's a non-modifying operation. Multiple reads can execute in parallel! But have to wait for writes.

Stages:

  1. Lookup the record's location in the index.
  2. Lookup the value at this location in the storage.

Delete

Stages:

  1. Lookup the record's location in the index. If it doesn't exist, abort and report an error.
  2. Write to WAL.
  3. Remove the entry from index.
  4. Mark the operation in WAL as successful.

2 and 3 can be done in parallel, all the other operations depend on the previous one.

Update (full)

Updates the whole record (optionally, inserting a new one if it doesn't exist). Stages:

  1. Lookup the record's location in the index. If it doesn't exist, and insert if not present == false, go to 6 and report an error.
  2. Write to WAL.
  3. Delete this record (see Delete). Fail if the record doesn't exist, unless "insert if doesn't exist is set".
  4. Update the index with the actual location of the new data.
  5. Write the new data to the storage.
  6. Mark the operation in WAL as successful.

4 and 5 may be concurrent? All the other stages depend on the previous one linearly.

Append to array

For a record with array, specifies only the new content to be appended to the array, since duplicating the complete data is wasteful. Stages:

  1. Lookup the record's location in the index. If it doesn't exist, and insert if not present == false, go to 5 and report an error.
  2. Write to WAL.
  3. Update the index with the actual location of the new data.
  4. Write the new data to the storage.
  5. Mark the operation in WAL as successful.

3 and 4 may be concurrent? All the other stages depend on the previous one linearly.

COMPLICATIONS: an array becomes sparse - scattered across the storage. The index needs to track all the parts. Updating the array (replacing old contents with new contents) requires scrubbing all the old chunks. It might be prudent to support differential update (only the chunks that changed), but this requires knowing (fetching?) all the current contents to build the diff. It also requires the index to not only know the list of chunks the array is scattered across, but which data is in which chunk. Not feasible?

Further development ideas

  • Not writing tombstone values can achieve MVCC.
  • Storage file can be split into blocks with metadata. Does that simplify anything?
    • Info about deleted and stale (updated records). Garbage collection could be multi-threaded.
    • Locking can be at the block level instead of the whole storage. But this doesn't necessarily require any metadata to be present in the storage itself.
    • Write batching. How it affects the WAL? Can reads return pending data from the write buffer? And should they?

Error handling

For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think.
Source: https://ravendb.net/articles/re-are-you-sure-you-want-to-use-mmap-in-your-database-management-system