Low Level Design: B-Tree Database Index

⏱ 9 min read

B-Tree Node Structure

A B-tree of minimum degree t enforces a capacity invariant on every node except the root: each node holds between t-1 and 2t-1 keys. An internal node with k keys stores a parallel array of k+1 child pointers. The keys act as separators: every key in the subtree rooted at child[i] is strictly greater than key[i-1] and strictly less than key[i]. A leaf node stores keys alongside either full values or record pointers (page/offset pairs) into a heap file, but carries no child pointers. Node size is a deliberate engineering choice. Setting node size equal to the underlying storage block size — typically 4 KB for SSDs and spinning disks, or 16 KB as PostgreSQL uses — means a single disk read fetches exactly one full node. Larger pages increase the branching factor (more keys per node), reducing tree height and thus the number of I/Os per lookup at the cost of reading more bytes per node access. For a table with 100 million rows and 16-byte keys in a 4 KB node, tree height stays around 4, meaning at most 4 disk reads to reach any leaf.

Search Algorithm

Search starts at the root and descends one level per step. At each node the algorithm performs a binary search across the sorted key array to identify the correct child subtree: if the target key equals key[i] the search succeeds immediately; if the target is less than key[0] descend into child[0]; if it falls between key[i] and key[i+1] descend into child[i+1]; if it exceeds all keys descend into the rightmost child. Reaching a leaf without finding the key means it is absent. The dominant cost is disk I/O, not CPU. Each level requires one page read. Tree height is O(log_t N) where N is the total number of keys and t is the branching factor. With t = 200 (typical for integer keys in a 4 KB page) and N = 10^9, height is roughly 4. In-memory binary search within each node is negligible. Caching the root and upper levels in the buffer pool reduces average disk reads further — the root is almost always hot.

Insertion

Insertion locates the correct leaf via the same descent used for search, then inserts the new key in sorted position. If the leaf has fewer than 2t-1 keys it has room and the operation completes. If the leaf is full (exactly 2t-1 keys) it must be split before the key is inserted. The split-on-the-way-down strategy avoids a second upward pass: as the search descends, any full node encountered is split proactively, guaranteeing the parent will have room to receive the promoted median key. The algorithm thus makes a single top-down pass. When the root itself is full, a new empty root is created, the old root is split into two children, and the tree grows in height by exactly one. Tree height increases only at the root, keeping the tree perfectly balanced at all times.

Node Splitting

Splitting a full node containing 2t-1 keys produces two nodes of t-1 keys each and promotes the median key (at index t-1) to the parent. Concretely: allocate a new right-sibling node; copy keys t through 2t-2 into the new node (and the corresponding t child pointers if the node is internal); truncate the original node to keys 0 through t-2; insert the median key into the parent at the appropriate position alongside a pointer to the new right sibling. The cost is one new node allocation plus O(t) key and pointer copies — independent of N. For a 4 KB page with 200-way branching factor, splitting copies at most ~100 keys. The parent gains exactly one key and one child pointer. Because the parent was guaranteed not to be full (split-on-the-way-down), no cascading split is required in the single-pass variant, though a recursive variant can split bottom-up instead.

Deletion

Deletion handles three structural cases. Case 1: the key is in a leaf and the leaf has more than t-1 keys — remove it directly. Case 2: the key is in an internal node — replace it with its in-order predecessor (the largest key in the left subtree) or in-order successor (smallest in the right subtree), then delete that replacement key from the leaf it came from; the leaf deletion then reduces to Case 1 or Case 3. Case 3: the leaf after removal would have fewer than t-1 keys (underflow). Underflow resolution: first try to borrow from an immediate sibling via the parent — rotate one key from the sibling through the parent separator into the deficient node; this requires only local pointer updates. If both siblings are at minimum occupancy, merge the deficient node with one sibling and pull the parent separator key down into the merged node. The merged node has 2t-2 keys (within capacity). The parent loses one key and one child pointer; if the parent underflows, the same borrow/merge logic applies recursively. If merging collapses the root to zero keys, the single remaining child becomes the new root and height decreases by one.

Range Queries

A range query for keys in [lo, hi] begins with a point search for lo, descending to the leaf containing the smallest key >= lo. From that position the algorithm performs an in-order traversal, collecting keys until a key exceeding hi is encountered or the tree is exhausted. In a plain B-tree this traversal may require ascending to parent nodes to find the next sibling leaf, which can add pointer-chasing overhead. The B+ tree optimization addresses this directly: leaf nodes maintain a doubly-linked list connecting them in key order. After finding the start leaf via the tree descent, range traversal becomes a simple sequential scan along the leaf chain — no parent backtracking required. Sequential I/O is cache-friendly and maps efficiently to read-ahead prefetching in storage hardware. This is the primary reason B+ trees dominate database index implementations.

B+ Tree vs B-Tree

In a B+ tree, all data records (or pointers to records) live exclusively in leaf nodes. Internal nodes store only keys as routing information and carry no associated values. This distinction has concrete consequences. With values absent from internal nodes, more keys fit per internal page, increasing the branching factor and reducing tree height for the same workload. A tree storing 8-byte keys without values in internal nodes versus 8-byte keys plus 100-byte values might have branching factor 400 vs 30 respectively. Range scans benefit from the linked-leaf structure described above. Point lookups always reach a leaf regardless, so read I/O count is the same as a B-tree; the B+ tree simply has shorter paths due to higher branching factor. PostgreSQL’s B-tree index implementation follows B+ tree conventions. MySQL InnoDB’s clustered index is a B+ tree where leaf nodes store the full row data; secondary indexes store the primary key value at each leaf, requiring a second B+ tree lookup (index-to-primary-key) for non-covering queries.

WAL Integration

B-tree modifications never write directly to the in-place page without first recording the change in the write-ahead log (WAL). The WAL record for an operation includes the before-image and after-image of every page modified, or a logical description of the operation sufficient to redo it. The WAL entry is flushed to durable storage before the modified data pages — this is the “write-ahead” guarantee. Only after the WAL record is safely on disk does the buffer manager allow the dirty data page to be written. On crash recovery: the database replays WAL records from the last checkpoint forward, redoing all committed transactions’ changes and undoing all uncommitted transactions’ changes using the before-images. Node splits and merges touch multiple pages atomically — the WAL covers all affected pages in a single logical operation so partial splits cannot survive a crash. This logging discipline, combined with B-tree’s ability to absorb random writes by buffering in internal nodes and deferring leaf writes, makes B-tree indexes practical for transactional workloads with mixed read/write patterns.

What is the structure of a B-tree node?

A B-tree node of order m contains at most m-1 keys and m child pointers. All leaf nodes are at the same depth. Internal nodes act as routers: keys partition the key space among child subtrees. The minimum fill factor is ceil(m/2)-1 keys per node, ensuring the tree remains balanced after deletions.

How does B+ tree differ from B-tree for database indexing?

In a B+ tree, all data records are stored only at leaf nodes. Internal nodes contain only keys for routing. Leaf nodes are linked in a doubly linked list enabling efficient range scans. Most database engines (InnoDB, PostgreSQL) use B+ trees because sequential range queries are critical for SQL workloads.

What happens during a B-tree node split?

When an insertion causes a node to exceed m-1 keys, the node splits: the median key is promoted to the parent, and the node divides into two nodes each with half the keys. If the parent also overflows, the split propagates upward. In the worst case, a split propagates all the way to the root, increasing tree height by one.

How does WAL integrate with B-tree updates?

Write-Ahead Logging (WAL) records every B-tree modification (key insert, node split, page allocation) to a log before applying it to the data file. On crash recovery, the engine replays WAL entries to restore the B-tree to a consistent state. PostgreSQL uses a combination of WAL and page-level checksums to detect partial writes.

What is the fill factor in a B-tree index?

Fill factor specifies how full each leaf page is packed during initial index creation (e.g., 90% full). Leaving space (10%) reduces page splits on future insertions at the cost of slightly larger index size. Frequently updated indexes use lower fill factors to amortize split costs over more insertions.