Asynchronous IO

RocksDB can reduce the latency of certain IO bound user queries using asynchronous IO. This capability currently benefits long scans and MultiGet. It requires a FileSystem implementation that supports async IO. Currently, PosixFileSystem supports it on Linux kernel versions that support IO uring.

Scan

When async IO is enabled for an iterator, the iterator tries to parallelize operations as much as possible by not blocking on IO. This is accomplished in 2 ways -

Seek - A seek operation on an iterator positions all the child iterators at the given target key. There is a child iterator for each level in the LSM, and the MergingIterator calls seek on each child LevelIterator, which has to open the appropriate SST file and position the table iterator on the data block containing the closest key at or after the target key in the file. If the data block is not found in cache, it must be read from disk. This is typically a blocking read. However, with async IO, the table iterator seek happens in two stages. In the first stage, an async read is issued for the data block and a special status (Status::TryAgain()) is returned to the MergingIterator to indicate that a read is in progress. After a first pass through all the LevelIterators, the MergingIterator makes a second pass through the ones that returned Status::TryAgain(). In the second pass, each iterator waits for the read to complete, finishes positioning the iterator and then returns. This allows the data block cache misses and the resultant reads to be done in parallel.
Next - When the user calls next on the iterator, it may require one or more child iterators to advance to the next data block. If its not in the block cache, a file read may be triggered. The iterator has prefetching logic to read ahead some amount of data beyond what's requested into a buffer. Subsequent Next operations can read data blocks from the buffer directly, until the end of buffer is reached and another read is required. Async IO takes this a step further by initiating a prefetch read when the iterator is at the midpoint of the prefetch buffer. The async prefetch read is for data beyond the current prefetch buffer. As long as prefetched data is useful, the iterator will keep asynchronously prefetching more data.

This option applies to direct IO. If buffered IO is used, the iterator relies on the page cache readahead.

Known limitations -

Short scans may prefetch more data than necessary, compared to a scan without async_io.
It does not apply to the CompactionIterator at the moment.

MultiGet

The batched MultiGet API will use async IO where possible to read data blocks from multiple SST files in the same non-L0 level in parallel. A batch of MultiGet keys may overlap with many SST files in a level. By reading from these files in parallel using async IO, the overall MultiGet latency is reduced.

Known limitations -

This feature requires RocksDB to be compiled with folly using a compiler with C++ 20 support. It relies on coroutines support in folly. The integration with folly is currently experimental.
Metadata block reads are blocking reads.
No parallelism across levels. The lookup in each LSM level will happen only after the previous level is finished.
No parallelism in L0.
This works best with larger batch sizes with IO bound workloads. CPU usage may increase due to the coroutine overhead.

Configuration

Asynchronous IO for scans and MultiGet can be enabled by setting the async_io option in ReadOptions. For MultiGet async IO, RocksDB has to be compiled using c++ 20 and -DUSE_COROUTINES compiler flag, and linked with folly.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous IO

Scan

MultiGet

Configuration

Clone this wiki locally