B-Tree Index Low-Level Design: Node Structure, Split/Merge, Concurrency Control, and Bulk Loading

B-Tree Structure and Page Layout

A B-tree index is a balanced tree where every node is a fixed-size disk page (8KB in PostgreSQL, 16KB in MySQL InnoDB). Internal nodes store keys and child page pointers. Leaf nodes store keys and either row pointers (heap file TIDs in a secondary index) or full row data (clustered index, also called an index-organized table). Every leaf is at the same depth — this is what “balanced” means and what guarantees O(log N) search regardless of which key is accessed.

The branching factor is determined by page size and key size. An 8KB page with 16-byte keys and 6-byte pointers fits roughly 400 key-pointer pairs. A tree with 3 levels can index 400³ ≈ 64 million rows with at most 3 I/Os per lookup. Larger pages and smaller keys increase the branching factor and reduce tree height.

The fill factor (e.g., 70–90%) leaves empty space in each page when the index is built or pages are split. This space absorbs future inserts without immediately triggering another split. Setting fill factor too low wastes space; too high causes frequent splits on write-heavy indexes.

Search Operation

To find a key, start at the root page. Binary search within the root's key array finds the child pointer to follow. Repeat at each internal level until reaching a leaf. The leaf is scanned to find the matching key (or the first key in a range). For a heap-based secondary index, the row pointer (TID) is used to fetch the actual row from the heap file in a separate I/O.

Search cost is O(log_B N) I/Os where B is the branching factor and N is the number of keys. With a branching factor of 400 and 64 million rows, this is 3 I/Os in the worst case — typically 1–2 because the root and upper levels are usually in the buffer pool cache.

Insert Operation and Page Split

To insert a key, search for the correct leaf page. If the leaf has free space (fill factor not exceeded), insert the key in sorted order in the page and mark it dirty. If the leaf is full, a split occurs:

Allocate a new page.
Redistribute keys: the lower half stays in the original page, the upper half moves to the new page.
The median key is promoted to the parent internal node as a separator.
If the parent is also full, it splits recursively. In the worst case, the split propagates all the way to the root, increasing tree height by one.

Splits are logged to WAL before they are applied, so they are crash-safe. A split requires at least 3 page writes (original leaf, new leaf, parent) plus WAL records.

Delete Operation and Page Merge

To delete a key, find the leaf and remove the entry. If the leaf falls below the minimum occupancy threshold (typically 50%), the engine attempts to merge it with a sibling. If the sibling is also near minimum, the two pages are merged into one and the separator key is removed from the parent (which may trigger a merge up the tree). If the sibling has enough keys, a redistribute (rotation) moves one key from the sibling, avoiding the merge.

In practice, many systems (including PostgreSQL) are conservative about merges and simply mark underflowing pages as available for future reuse rather than immediately merging, trading some space for fewer write operations.

Concurrency Control: Latch Crabbing

B-trees in concurrent databases use latches (short-duration mutex-like locks on pages) rather than transactions-level locks for structural integrity. A naïve approach of holding a latch on every page visited during a traversal would serialize all writers. Latch crabbing (also called latch coupling) allows concurrent access:

Read (shared) latch: acquire shared latch on child, release shared latch on parent. Only one page is latched at a time for reads.
Write (exclusive) latch: acquire exclusive latch on child, release exclusive latch on parent — only if the child page is safe (not full on insert, not at minimum on delete). If the child is unsafe (may need a split or merge that touches the parent), hold the parent latch while descending.

This allows multiple concurrent readers to traverse the tree simultaneously. Writers only hold latches on pages they are actively modifying, minimizing contention.

Bulk Loading via Bottom-Up Construction

Loading millions of rows one insert at a time causes many random splits and generates excessive WAL volume. Bulk loading is far more efficient:

Sort all keys externally (external merge sort).
Fill leaf pages sequentially to the fill factor, left to right.
Build internal nodes by collecting the first key of each leaf page.
Build the next level of internal nodes from the internal node keys.
Continue up to the root.

The result is a perfectly packed B-tree built in O(N log N) time (dominated by sorting) with zero splits. PostgreSQL uses this approach for CREATE INDEX and CLUSTER operations.

WAL Integration and Buffer Pool

Every structural B-tree change (insert, delete, split, merge) is recorded in WAL before the page is written to disk. This ensures crash recovery can replay any partial operation. Pages are never written directly to the data file; they pass through the buffer pool (page cache), which manages which pages are in memory and handles dirty page eviction. B-tree pages that are frequently accessed (root, upper-level internal nodes) typically stay resident in the buffer pool permanently, reducing effective I/O depth to 1–2 for most lookups.

SQL Data Model

-- B-tree page metadata (the actual data is in the index file; this tracks page stats)
CREATE TABLE BTreePage (
    page_id     BIGINT NOT NULL,
    tree_id     INT NOT NULL,
    level       INT NOT NULL,          -- 0 = leaf, 1+ = internal
    num_keys    INT NOT NULL,
    free_space  INT NOT NULL,          -- bytes available in the page
    data        BYTEA,                 -- serialized page content (for logical representation)
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (tree_id, page_id)
);

-- Index-level statistics
CREATE TABLE BTreeStat (
    index_name      VARCHAR(255) PRIMARY KEY,
    height          INT NOT NULL,
    num_pages       BIGINT NOT NULL,
    num_keys        BIGINT NOT NULL,
    fill_factor     NUMERIC(4,2) NOT NULL DEFAULT 0.70,
    last_analyzed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_btreepage_level ON BTreePage(tree_id, level);

Python Implementation Sketch

from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional

PAGE_SIZE = 8192      # 8 KB
FILL_FACTOR = 0.70

@dataclass
class BTreeNode:
    page_id: int
    level: int          # 0 = leaf
    keys: list = field(default_factory=list)
    children: list[int] = field(default_factory=list)   # page_ids for internal nodes
    values: list = field(default_factory=list)           # row pointers for leaf nodes

    @property
    def is_leaf(self) -> bool:
        return self.level == 0

    def is_full(self, max_keys: int) -> bool:
        return len(self.keys) >= max_keys

class BTree:
    def __init__(self, order: int = 200):
        self.order = order          # max keys per node
        self.root: Optional[BTreeNode] = None
        self._next_page_id = 0

    def _new_node(self, level: int) -> BTreeNode:
        node = BTreeNode(page_id=self._next_page_id, level=level)
        self._next_page_id += 1
        return node

    def search(self, key) -> Optional[object]:
        """Return value for key, or None if not found."""
        return self._search_node(self.root, key) if self.root else None

    def _search_node(self, node: BTreeNode, key) -> Optional[object]:
        i = self._binary_search(node.keys, key)
        if node.is_leaf:
            if i < len(node.keys) and node.keys[i] == key:
                return node.values[i]
            return None
        child_idx = i if i  int:
        lo, hi = 0, len(keys)
        while lo < hi:
            mid = (lo + hi) // 2
            if keys[mid]  None:
        if self.root is None:
            self.root = self._new_node(level=0)
        result = self._insert_node(self.root, key, value)
        if result:  # root was split
            promoted_key, new_node = result
            new_root = self._new_node(level=self.root.level + 1)
            new_root.keys = [promoted_key]
            new_root.children = [self.root.page_id, new_node.page_id]
            self.root = new_root

    def _insert_node(self, node: BTreeNode, key, value):
        i = self._binary_search(node.keys, key)
        if node.is_leaf:
            node.keys.insert(i, key)
            node.values.insert(i, value)
            if node.is_full(self.order):
                return self.split_page(node.page_id)
            return None
        # Descend to appropriate child (simplified: no actual page loading)
        return None

    def split_page(self, page_id: int):
        """Split a full page, return (promoted_key, new_node)."""
        # Placeholder: in practice, load page, split keys, write both pages to WAL
        print(f"Splitting page {page_id}")
        return None

    def delete(self, key) -> bool:
        """Remove key from tree. Returns True if found and deleted."""
        # Placeholder: find leaf, remove key, handle underflow
        return False

    def bulk_load(self, sorted_pairs: list[tuple]) -> None:
        """Build B-tree bottom-up from pre-sorted (key, value) pairs."""
        if not sorted_pairs:
            return
        max_leaf_keys = int(self.order * FILL_FACTOR)
        # 1. Fill leaf pages
        leaves: list[BTreeNode] = []
        for i in range(0, len(sorted_pairs), max_leaf_keys):
            chunk = sorted_pairs[i:i + max_leaf_keys]
            leaf = self._new_node(level=0)
            leaf.keys = [k for k, v in chunk]
            leaf.values = [v for k, v in chunk]
            leaves.append(leaf)
        # 2. Build internal levels bottom-up
        current_level = leaves
        level_num = 1
        while len(current_level) > 1:
            parents: list[BTreeNode] = []
            for i in range(0, len(current_level), self.order):
                group = current_level[i:i + self.order]
                parent = self._new_node(level=level_num)
                parent.keys = [n.keys[0] for n in group[1:]]
                parent.children = [n.page_id for n in group]
                parents.append(parent)
            current_level = parents
            level_num += 1
        self.root = current_level[0]
        print(f"Bulk load complete: {len(sorted_pairs)} keys, height={level_num}")

Frequently Asked Questions

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How expensive is a B-tree page split compared to a merge?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A split requires allocating a new page, redistributing keys, writing two modified pages, updating the parent (which may itself split), and logging all changes to WAL — typically 3 or more page writes plus WAL records. A merge requires reading two sibling pages, combining their contents, writing the merged page, deleting the empty page, and updating the parent separator key — similar write cost. In practice, many databases defer or avoid merges entirely, marking underflowing pages for future reuse rather than merging immediately, because merges can cascade and are expensive to implement correctly.”
}
},
{
“@type”: “Question”,
“name”: “What is latch crabbing and why does it matter?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Latch crabbing (latch coupling) is a concurrency technique for B-tree traversal. When descending the tree, you acquire a latch on the child page before releasing the latch on the parent page — like a crab walking, always holding one foothold before releasing another. For reads, a shared latch is acquired on the child and released on the parent immediately, allowing multiple concurrent readers. For writes, exclusive latches are held on the path only when a split or merge might propagate upward. Latch crabbing prevents a concurrent writer from restructuring a page you are currently reading, without serializing all access to the tree.”
}
},
{
“@type”: “Question”,
“name”: “How do I choose the right fill factor?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fill factor controls how full pages are packed during index creation or rebuilding. A lower fill factor (e.g., 70%) leaves more free space per page, reducing split frequency for insert-heavy workloads at the cost of more pages and slightly higher read I/O. A higher fill factor (e.g., 90%) uses space more efficiently and speeds up sequential scans but causes more frequent splits on heavily inserted indexes. For read-only or rarely updated indexes, use 100% fill factor (FILLFACTOR=100 in PostgreSQL). For frequently inserted indexes, 70-80% is common. Analyze the insert rate and split rate from index stats to tune over time.”
}
},
{
“@type”: “Question”,
“name”: “Why is bulk loading faster than incremental inserts for building a B-tree index?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Incremental inserts trigger individual splits, generate WAL records per insert, and cause random I/O as each new key potentially touches a different leaf page. Bulk loading pre-sorts the keys (one sequential scan), then fills leaf pages sequentially from left to right — no splits occur because pages are filled to fill_factor, not beyond it. Internal nodes are built in a single bottom-up pass. The result is a perfectly packed tree built with O(N) sequential I/Os after the initial sort, plus no WAL overhead for individual inserts. PostgreSQL's CREATE INDEX uses this approach and is orders of magnitude faster than inserting rows one at a time.”
}
}
]
}

Split vs Merge Cost

A split requires allocating a new page, redistributing keys, writing two pages, updating the parent, and logging to WAL — typically 3+ page writes. A merge has similar cost. Many databases defer merges, marking underflowing pages for future reuse rather than merging immediately, because merges can cascade.

Latch Crabbing Explained

Latch crabbing acquires a latch on the child page before releasing the parent latch — always holding one foothold before releasing another. For reads, shared latches allow multiple concurrent traversals. For writes, exclusive latches are only held on the path when a split or merge might propagate upward.

Choosing Fill Factor

Lower fill factor (70%) reduces split frequency for insert-heavy workloads at the cost of more pages and slightly higher scan I/O. Higher fill factor (90%+) uses space efficiently but causes more frequent splits. Use 100% for read-only indexes, 70–80% for frequently inserted indexes.

Bulk Load vs Incremental Insert

Incremental inserts trigger individual splits and random I/O. Bulk loading sorts keys first, then fills leaf pages sequentially — zero splits, no per-row WAL overhead, O(N) sequential I/Os after sorting. PostgreSQL's CREATE INDEX uses bottom-up bulk loading and is orders of magnitude faster than one-at-a-time inserts.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does latch crabbing enable concurrent B-tree access?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A reader acquires a shared latch on the root and a child before releasing the root latch; this cascades down the tree; writers acquire exclusive latches only on pages they will modify; since only modified pages are exclusively latched, reads and writes on disjoint paths proceed concurrently.”
}
},
{
“@type”: “Question”,
“name”: “How does a page split propagate upward?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a leaf node overflows, it splits into two nodes and promotes the median key to the parent; if the parent is also full, it splits recursively; in the worst case, splits propagate all the way to the root, increasing tree height by one.”
}
},
{
“@type”: “Question”,
“name”: “How is fill factor used during bulk loading?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Bulk loading sorts keys and fills leaf pages to the fill factor (e.g., 70%), leaving 30% free; this reserves space for future inserts near the bulk-loaded keys, deferring costly splits until the reserved space is exhausted.”
}
},
{
“@type”: “Question”,
“name”: “How does a clustered index differ from a secondary index in a B-tree?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A clustered (primary key) index stores the actual row data in the leaf nodes, physically ordering rows by the index key; secondary indexes store only the indexed key and a pointer to the primary key, requiring a second lookup to fetch the full row.”
}
}
]
}