RocksDB, ZenFS & F2FS Analysis - nicktehrany/notes Wiki

These are my notes on analyzing and understanding the details of each of these applications, their implementations, and how we can purposely trigger certain events (mainly GC). Additionally, how can we extract information about GC, counters, how often and when it gets triggered.



Rocksdb uses memtables which buffer writes in memory before being flushed to an sstfile on the storage device. Any reads/writes go directly to the memtable1 first before any other sstable2. Once such a memtable is full, a background thread will flush it to the next level SST, and then the memtable is destroyed (as it's replaced by another). For this writes can be optionally also placed into a logfile WAL (for consistency), which is a sequentially written file on the storage device.

However, memtables can also be flushed before they are full (see here). When a memtable is being flushed, it is turned into an immutable memtable and is inserted into the flush pipeline 3. Recall, compaction is done by background threads, which is why it is added to the pipeline, such that new writes can be accepted and there is no stall. The WAL is flushed to the device after every user write, such that the state of the memtable can always be recovered after a crash. Duplicate or removed keys are also removed during compaction when the memtable is written to an L0 SST.

Reads are served from and LRU cache of blocks (see here).

Garbage is also removed the same way compaction is done, with a background thread on compaction.




1: RocksDB - MemTable, 2: RocksDB - High Level Architecture, 3: RocksDB - Memtable Pipeline,