SSTables and Compaction - wildcatdb/wildcat GitHub Wiki
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β L1 (64MB) L2 (640MB) L3 (6.4GB) L4 (64GB) β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β SST_1 β β SST_5 β β SST_9 β β SST_15 β β
β β Range: β β Range: β β Range: β β Range: β β
β β[apple, β β[apple, β β[apple, β β[apple, β β
β β fish] β β cat] β β bird] β β ant] β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β SST_2 β β SST_6 β β SST_10 β β SST_16 β β
β β Range: β β Range: β β Range: β β Range: β β
β β[grape, β β[dog, β β[car, β β[bat, β β
β β mouse] β β house] β β garden] β β fox] β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β βββββββββββ βββββββββββ βββββββββββ β
β β SST_3 β β SST_7 β β SST_11 β Size-Tiered (L1-L2) β
β β Range: β β Range: β β Range: β Leveled (L3+) β
β β[night, β β[ice, β β[hat, β β
β β stone] β β ocean] β β moon] β β
β βββββββββββ βββββββββββ βββββββββββ β
β βββββββββββ βββββββββββ βββββββββββ β
β β SST_4 β β SST_8 β β SST_12 β β
β β Range: β β Range: β β Range: β β
β β[tree, β β[paper, β β[nest, β β
β β zebra] β β zoo] β β zoo] β β
β βββββββββββ βββββββββββ βββββββββββ β
β β
β KEY CHARACTERISTICS: β
β β’ Each SSTable stores keys in sorted order internally β
β β’ Range = [smallest_key, largest_key] in that SSTable β
β β’ L1-L2: Overlapping ranges allowed (size-tiered compaction) β
β β’ L3+: Non-overlapping ranges enforced (leveled compaction) β
β β’ Bloom filters help skip SSTables during point queries β
β β
β Compaction Process β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Compaction Scheduler (250ms interval default) β β
β β β β
β β Score = (levelSize/capacity)*0.8 + (sstCount/threshold)*0.2 β β
β β β β
β β If score > 1.0: Schedule compaction job β β
β β - Merge overlapping key ranges β β
β β - Resolve duplicate keys (newest wins) β β
β β - Apply tombstones for deletions β β
β β - Last level partitioning when capacity exceeded β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example Compaction: L1βL2 β β
β β β β
β β Input: SST_1[apple,fish] + SST_2[grape,mouse] β β
β β Output: SST_new[apple,mouse] (merged and sorted) β β
β β β β
β β Key merge process: β β
β β apple, fish, grape, mouse β apple, fish, grape, mouse β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Immutable SSTables are organized into levels.
- L1βL2 use size-tiered compaction.
- L3+ use leveled compaction by key range.
- Concurrent compaction with configurable maximum concurrency limits.
- Compaction score is calculated using a hybrid formula
score = (levelSize / capacity) * 0.8 + (sstableCount / threshold) * 0.2
compaction is triggered whenscore > 1.0
- Cooldown period enforced between compactions to prevent resource thrashing.
- Compaction filters out redundant tombstones based on timestamp and overlapping range.
- A tombstone is dropped if it's older than the oldest active read and no longer needed in higher levels.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SSTable Structure β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SSTable Metadata: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ID: 123 | Min: "apple" | Max: "zebra" | Size: 64MB β β
β β EntryCount: 50000 | Level: 1 | BloomFilter: Present β β
β β Timestamp: 1609459200000 (latest timestamp based on entries) β β
β β isMerging: atomic flag | isBeingRead: atomic flag β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β File Layout: β
β β
β sst_123.klog (BTree for Keys) sst_123.vlog (Values) β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β BTree Root β β Block Manager β β
β β (Block ID: 1) β β β β
β β β β Block 1: [value_data_1] β β
β β βββββββββββββββββββββββ β β Block 2: [value_data_2] β β
β β β Internal Node β β β Block 3: [value_data_3] β β
β β β β β β Block 4: [value_data_4] β β
β β β Keys: [m, s] β β β ... β β
β β β Children: [2,3,4] β β β Block N: [value_data_N] β β
β β βββββββββββββββββββββββ β βββββββββββββββββββββββββββ β
β β β β
β β βββββββββββββββββββββββ β KLog Entry Format: β
β β β Leaf Node (Block 2) β β βββββββββββββββββββββββββββββββββββ β
β β β β β β struct KLogEntry { β β
β β β apple β Block 1 β β β Key: []byte β β
β β β cat β Block 2 β β β Timestamp: int64 β β
β β β dog β Block 3 β ββββββββββΆβ ValueBlockID: int64 β β
β β β ... β β β } β β
β β βββββββββββββββββββββββ β β β β
β β β β Special: ValueBlockID = -1 β β
β β βββββββββββββββββββββββ β β indicates deletion tombstone β β
β β β Leaf Node (Block 3) β β βββββββββββββββββββββββββββββββββββ β
β β β β β β
β β β mouse β Block 4 β β Bloom Filter (Optional): β
β β β rat β Block 5 β β βββββββββββββββββββββββββββββββββββ β
β β β snake β Block 6 β β β Double Hashing (FNV-1a + FNV) β β
β β β ... β β β False Positive Rate: 0.01 β β
β β βββββββββββββββββββββββ β β Auto-sized based on entries β β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β
β Read Path: β
β 1. Entry Count Check (skip if empty) β
β 2. Range Check (Min/Max keys) β
β 3. Bloom Filter Check (if enabled) β
β 4. BTree Search in KLog β
β 5. Value Retrieval from VLog β
β 6. MVCC Timestamp Filtering β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SSTable Metadata
Each SSTable tracks the following main meta details:
- Min and Max keys for fast range filtering
- EntryCount (total number of valid records) used for recreating filters if need be
- Size (approximate byte size) used for when reopening levels
- Optional BloomFilter for accelerated key lookups
- Level (mainly tells us during reopening and compaction)
We only list the main meta data but there is more for internal use.
SSTable Format
SSTables are prefix and range optimized immutable BTree's.
Structure
- KLog
.klog
Contains BTree with key metadata and offsets(block ids) to values within VLog - VLog
.vlog
Contains actual value data in append-only format.
During lookups
- Range check Min/Max key range validation (skipped if outside bounds)
- Bloom filter Consulted first if enabled (configurable false positive rate)
- BTree search Key lookup in KLog BTree structure
- Value retrieval Actual values retrieved from VLog using block IDs from KLog entries
- MVCC filtering Only versions visible to read timestamp are returned
Compaction Policy / Strategy
- L1-L2 Size-tiered compaction for similar-sized SSTables
- L3+ Leveled compaction based on key range overlaps
- Concurrent execution Configurable max concurrency with priority-based job scheduling
- Cooldown mechanism Prevents thrashing with configurable cooldown periods
- Last level partitioning When last level exceeds capacity, redistributes data to L-1 (70%) and L-2 (30%) by default
Compaction Scoring Formula
score = (levelSize / capacity) * sizeWeight + (sstableCount / threshold) * countWeight
- Default weight sizeWeight=0.8, countWeight=0.2
- Trigger threshold Compaction starts when score > 1.0
- Priority-based Higher scores get priority in the compaction queue
Compaction Process
- Job scheduling Background process evaluates all levels every
CompactorTickerInterval
- Priority queue Jobs sorted by compaction score (highest first)
- Concurrent execution Up to
MaxCompactionConcurrency
jobs run simultaneously - Atomic operations Source SSTables marked as merging, target level updated atomically
- Active read safety Waits for isBeingRead flag to clear before removing SSTables
- Cleanup Old SSTable files removed after successful compaction