B‐Trees - rFronteddu/general_wiki GitHub Wiki

B-trees are used by many modern DBMSs to perform efficient data lookups via indexes. B-trees store pairs of data known as keys and values in a tree-like structure.

Every B-tree is made up of nodes and child pointers. The top-most node is the root, the bottom nodes are leaf nodes, and everything else internal nodes.

A B-tree of order K is a tree structure with the following properties:

Each node in the tree stores N key/value pairs, where N is greater than 1 and less than or equal to K.
Each internal node has at least N/2 key/value pairs (an internal node is one that is not a leaf or the root).
Each node has N+1 children.
The root node has at least one value and two children, unless it is the sole node.
All leaves are non the same level.

The other characteristic of a B-tree is ordering. Within each node the elements are kept in order. Any child to the left of a key must only contain other keys that are less than it. Children to the right must have keys that are greater than it.

This enforced ordering means you can search for a key very efficiently. Starting at the root node, doing:

Check if the node contains the key you are looking for
If not, find the location in the node where your key would get inserted into, if you were adding it.
Follow the child pointer at this spot down to the next level, and repeat the process.

When searching in this way, you only need to visit one node at each level of the tree to search for one key. Therefore, the less levels it has (or the shallower it is), the faster searching can be performed.

B-trees are well suited to work well when you have a very large quantity of data that also needs to be persisted to long-term storage. This is because each node uses a fixed number of bytes that can be tailored to play nicely with disk blocks.

Reading and writing data on HDDs and SSDs is done in units called blocks. These are typically byte sequences of length 4096, 8192, or 16384 (4k, 8k, 16k). A single disk will have a capacity of many millions or billions of blocks. RAM on the other hand is typically addressable on a per-byte level.

This is why B-trees work so well when we need to organize and store persistent data on disk. Each node of a B-tree can be sized to match to the size of a disk block (or a multiple of this size).

The number of values each node of the tree can store is based on the number of bytes each is allocated and the number of bytes consumed by each key / value pair.

Consider storing 3 integer per node and 4 pointers. If our disk block and B-tree node is 16K, and our keys, values and child pointers are all 8 bits, we could store 682 key/value with 683 child pointers per node. A 3 level tree could store over 300 million k/v pairs (682x682x682)

B+Tree

B+ are similar to B-tree but:

K/v pairs are stored only at the leaf nodes.
Non-leaf nodes store only keys and the associated child pointers.

In MySQL

non-leaf nodes store N child pointers instead of N+1
all nodes also contain next and previous pointers, allowing each level of the tree to also act as a doubly-linked list.

B+ work better for DB because:

Since inner nodes do not have to store values, we can fit more keys per inner node. This can help keep the tree shallower.
All of the values can be stored at the same level, and traversed in-order via the bottom-level linked list.

Primary key choices

If we create a secondary index like this:

CREATE TABLE user (
  user_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  username VARCHAR(256) NOT NULL,
  email VARCHAR(256) NOT NULL,
  PRIMARY KEY (user_id)
);
CREATE INDEX email_index ON user(email);

And execute a query like: SELECT username FROM user WHERE email = '[email protected]';

This will first perform a lookup for [email protected] on the email_index B+tree. After it has found the associated user_id value it will use that to perform another lookup into the primary key B+tree, and fetch the username from there.

Overall, we'd like to always minimize the number of blocks / nodes that need to be visited to fulfill a query. The fewer nodes we have to visit, the faster our query can go. The primary key you choose for a table is pivotal in minimizing the number of nodes we need to visit.

Insertion

The way your table's data is arranged in a B+tree depends on the key you choose. This means your choice of PRIMARY KEY will impact the layout on disk of all of the data in the table, and in turn performance.

wo common choices for a primary key are:

An integer sequence (such as BIGINT UNSIGNED AUTO_INCREMENT)
A UUID, of which there are many versions

Let's first consider the consequences of using a UUIDv4, a mostly-random 128 bit integer.

A few observations:

The nodes visited for an insert are unpredictable ahead of time.
The destination leaf node for an insert is unpredictable.
The values in the leaves are not in order.

Issues 1 and 2 are problematic because over the course of many insertions we'll have to visit many of the nodes (pages) in the tree. This excessive reading and writing leads to poor performance. Issue 3 is problematic if we intend to ever search for or view our data in the order it was inserted.

The same problem can arise (albeit in a less extreme way) with some other UUIDs as well. For example, UUID v3 and v5 are both generated from via hashing, and therefore will not be sequential and have similar behavior to inserting randomly.

UUIDv7 actually does a good job of overcoming some of these challenges.

Let's consider using a sequential BIGINT UNSIGNED AUTO_INCREMENT as our primary key instead. Try inserting sequential values into the B+ tree instead:

This mitigates all of the aforementioned problems:

We always follow the right-most path when inserting new values.
Leaves only get added on the right side of the tree.
At the leaf level, data is in sorted order based on when it was inserted.

Because of 1 and 2, many insertions happening in sequence will revisit the same path of pages, leading to fewer I/O requests when inserting a lot of key/value pairs.

Reading Data Order

It's common to search for data from a database in time-sequenced order. Consider viewing the timeline on X, or a chat history in Slack. We typically want to see the posts and chat messages in time (or reverse-time) sequences. This means we'll often read chunks of database that are "near" each-other in time. These queries take the form:

SELECT username, message_text, ...
FROM post
  WHERE sent > $START_DATETIME
  AND sent < $END_DATETIME
  ORDER BY sent DESC;

The value sequences are spread out across many non-sequential leaf nodes for hashing UUIDs.

Consider finding sequentially inserted values instead, in such cases, all pages with the search results will be next to each other. It's even possible to search for several rows, and all of them will be next to each other in a single page. For this variety of query pattern, we can mitigate the number of pages that need to be read using a sequential primary key.

Size

Another important consideration is key size. We always want our primary keys to be:

Big enough to never face exhaustion
Small enough to not use excessive storage

For integer sequences, we can sometimes get away with a MEDIUMINT (16 million unique values) or INT (4 billion unique values) for smaller tables. For big tables, we often jump to BIGINT to be safe (18 sextillion possible values). BIGINTs are 64 bits (8 bytes). UUIDs are typically 128 bits (16 bytes), twice the size of even the largest integer type in MySQL. Since B+tree nodes are a fixed size, a BIGINT will allow us to fit more keys per-node than UUIDs. This results in shallower trees and faster lookups.

Recall that one of the big benefits of a B+tree is the fact that we can set the node size to whatever we want. In InnoDB, the B+tree nodes are typically set to 16k, the size of an InnoDB page.

When fulfilling a query (and therefore traversing B+trees), InnoDB does not read individual rows and columns from disk. Whenever it needs to access a piece of data, it loads the entire associated page from disk.

InnoDB has some tricks up its sleeve to mitigate this, the main one being the buffer pool. The buffer pool is an in-memory cache for InnoDB pages, sitting between the pages on-disk and MySQL query execution. When MySQL needs to read a page, it first checks if it's already in the buffer pool. If so, it reads it from there, skipping the disk I/O operation. If not, it finds the page on-disk, adds it to the buffer pool, and then continues query execution.

Without it, we'd end up doing significantly more disk I/O operations to handle a query workload. Even with the buffer pool, minimizing the number of pages that needs to be visited helps performance (1) because there's still a (small) cost to looking up a page in the buffer pool, and (2) it helps reduce the number of buffer pool loads and evictions that need to take place.