TokuFT Engine Architecture - Tokutek/tokumxse GitHub Wiki

Upstream Storage Engine API

The mongo storage engine api has a FAQ here but it hasn't been kept up to date much (last update was 4 months ago as of this writing).

Roughly, the api has these components that need to be implemented:

  • RecordStore: responsible for storing BSON documents accessible by RecordId. May be capped.
  • SortedDataInterface: implements an index (including the _id index), which maps index keys to RecordIds. May be unique.
  • RecoveryUnit: manages a transaction for the storage engine, essentially. Lives inside an OperationContext which is passed around through most functions in the code. Arguably the most complicated part of the api because it has strange rules around atomicity and read-your-own-writes that don't always map directly to typical transactional semantics. However, we've got this mostly figured out, it shouldn't need much tweaking.
  • StorageEngine: just provides the initialization and glue that lets the server create the above three objects

You'll also be exposed to the following types of objects:

  • RecordId: usually just an autoincrement integer (except in the case of the oplog where you have to extract it from the optime here that you get to generate inside the RecordStore.
  • [KeyString'](https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/key_string.h): The serialized form of either a RecordIdor a BSON key, optionally concatenated with aRecordId. Once encoded, these can be compared with memcmp, which is a nice feature. They can be decoded back into RecordIds and BSON keys, but to decode an index key you also need TypeBitswhich are usually stored next to the encodedKeyString`.
  • RecordData: just a chunk of bytes representing a record as stored on disk. It's a serialized BSON object, but you don't need to know that, just store it and hand it back.

KVDictionary Abstraction Layer

We decided that having separate RecordStore and SortedDataInterface classes was a bit complicated, and a better API would match the fractal tree API more closely as just a sorted key/value dictionary. Ideally this API would also be suitable for RocksDB and WiredTiger, but it seems unlikely that they'll use this API at this point. In any case, the code in the KVDictionary layer implements most of the complicated operations in the RecordStore and SortedDataInterface (including logic around capped collections and tailable cursors, as well as persisting size and count statistics for engines which can't provide those simply themselves) in terms of a simpler API called KVDictionary.

We don't simplify the RecoveryUnit abstraction in this layer yet, but it wouldn't be a bad idea to try.

The KVDictionary layer requires the implementation of the following classes:

  • KVDictionary: this is the main sorted map data structure. You need to be able to do inserts, gets, and deletes, and provide a Cursor which can seek to a key and go next in a given direction. Optionally, you can implement optimized versions of updates and duplicate key checks for unique indexes.
  • KVEngineImpl: just needs to create new KVDictionarys and RecoveryUnits. You still need to implement RecoveryUnit yourself.

The KVDictionary layer makes use of a class Slice which is just a contiguous range of bytes with some logic around memory management and borrowing. These are used for keys and values, it's pretty straightforward but be careful about returning owned Slices.

Reference Implementation

A non-persistent reference implementation of the KVDictionary API using a std::map is available, called kv_heap. It's a good place to start to see what it takes to make a basic engine, there's a bit of glue code in kv_heap_init.cpp and in the SConscript that's worth checking, and you should see how the unit tests work.

TokuFT

The TokuFT engine lives in tokuft and implements the KVDictionary API, including its optional parts (updates, capped delete optimization, duplicate key checking) and makes use of some optional parts of KVDictionary (KVSizeStorer to maintain persistent stats, and VisibleIdTracker to track document visibility in capped collections).

RecoveryUnits

The RecoveryUnit is a bit weird. The way we've done it is to not use child transactions, and to use a counter to track depth. We commit when the outermost commit happens, and abort if the outermost transaction ends without being committed. When we create a transaction, we peek at the lock state to see whether the client intends to write something; if they do we create a DB_SERIALIZABLE transaction so that we take read locks, if not, we create a DB_TXN_SNAPSHOT | DB_TXN_READ_ONLY transaction. We also have a way to detect when we're a secondary, and in that case we pretend we're prelocked because we know there will be no conflicts (https://github.com/Tokutek/tokumxse/blob/b84f79c259a7db416e1b01f673c20b7e35a2ca7d/src/mongo/db/storage/tokuft/tokuft_dictionary.cpp#L82-L100).

ftcxx

The TokuFT engine uses a new API layer above the ydb called ftcxx, which provides a somewhat more modern, fairly opinionated C++ API for the fractal tree. By far the most complicated thing here is the cursor API, which makes heavy use of templates to be able to inline things like comparison functions for bounds checking, and to do zero-copy but inlined callbacks for result handling, as well as automatic buffering for the bulk fetch API. Unfortunately, the structure of the cursors in the mongo storage engine can't make great use of this because they need to copy out the data, but the goal was to have something useful for more than just TokuMXse in the future, and it doesn't slow down TokuMXse. The templates are nasty, but the result is actually quite nice to work with, which you can see in the example in cursor_test.cpp.