Low Level Design: Database Indexing Strategies

⏱ 7 min read

B-Tree Index Internals

The B-tree (balanced tree) is the default index type in PostgreSQL, MySQL, and most relational databases. Understanding its structure explains why indexes help certain queries and not others:

Structure: A balanced tree of pages (typically 8KB each). Internal nodes store separator keys and child pointers. Leaf nodes store index key values and pointers to heap (table) rows (TIDs: tuple identifiers = page number + offset).
Leaf node linked list: Leaf nodes are linked in sorted order. This makes range scans efficient: after finding the start key via tree traversal (O(log N)), the scan follows the linked list without re-traversing the tree.
Lookup complexity: O(log N) for a point lookup. For a table with 1 billion rows, a B-tree with 8KB pages and 8-byte keys has a height of ~4-5 levels — only 4-5 page reads to find any row.
Write cost: INSERT/UPDATE/DELETE must update the index. Page splits occur when a page is full, requiring a write to the parent node. This is why heavily indexed tables have slower writes.

Composite Index Column Order

A composite index on (a, b, c) sorts rows first by a, then by b within each a value, then by c. The leftmost prefix rule governs which queries can use this index:

Queries filtering on a alone: use the index (prefix scan).
Queries filtering on a AND b: use the index efficiently.
Queries filtering on a AND b AND c: use the index fully.
Queries filtering on b alone (skipping a): cannot use this index — the index is sorted by a first, so b values are scattered throughout.

Column ordering rules:

Equality columns before range columns: If your query is WHERE a = 5 AND b > 10, the index should be (a, b), not (b, a). After satisfying the equality on a, the index can scan the range on b within that a partition.
If both columns have range conditions: Only the first range column can be used for an index range scan; subsequent range conditions must be filtered row-by-row.
Most selective first is a rule of thumb for equality columns — but only when queries don’t always include all columns. Analyze actual query patterns, not just selectivity in isolation.

Covering Indexes

A covering index contains all columns needed to satisfy a query — the query planner performs an index-only scan and never touches the heap (table data pages). This eliminates the most expensive part of a typical index scan: the random I/O to fetch heap pages.

-- Query:
SELECT email, created_at FROM users WHERE tenant_id = 42;

-- Covering index (PostgreSQL INCLUDE syntax):
CREATE INDEX ON users(tenant_id) INCLUDE (email, created_at);
-- tenant_id is the search key; email and created_at are stored in leaf nodes
-- but not part of the sort key, so they don't bloat internal nodes.

The INCLUDE clause (PostgreSQL 11+) adds non-key columns to leaf nodes only. This avoids the overhead of sorting on those columns while still covering the query. Without INCLUDE, you would add all columns to the key — which forces the index to sort on them, changing which queries benefit from the index. Use covering indexes for high-frequency queries on large tables where the heap access is the bottleneck.

Partial Indexes

A partial index indexes only the rows matching a WHERE predicate. This produces a smaller index that fits in memory better and is faster to scan:

-- Index only pending orders (a small fraction of the orders table):
CREATE INDEX ON orders(user_id) WHERE status = 'pending';

-- Index only active (non-deleted) users:
CREATE INDEX ON users(email) WHERE deleted_at IS NULL;

For the query to use a partial index, the query’s WHERE clause must logically imply the index predicate. PostgreSQL’s query planner checks this: WHERE status = 'pending' AND user_id = 42 implies status = 'pending', so it can use the partial index above. Partial indexes are ideal for skewed distributions: if 95% of orders are in terminal states (shipped, cancelled) and only 5% are pending, a partial index on pending orders is 20x smaller than a full index and covers the most common access pattern (processing the work queue).

Expression Indexes

An expression index indexes the result of a function applied to a column. This allows indexes to support queries on computed values:

-- Case-insensitive email lookup:
CREATE INDEX ON users(lower(email));
-- Query must use the same expression:
SELECT * FROM users WHERE lower(email) = lower('Alice@Example.com');

-- Date extraction:
CREATE INDEX ON events(date_trunc('day', created_at));
SELECT * FROM events WHERE date_trunc('day', created_at) = '2026-04-18';

The critical requirement: the query must use the exact same expression as the index definition for the planner to recognize the match. Expression indexes are immutable — the indexed function must be deterministic (no now(), no random functions). They add write overhead: every insert/update must evaluate the function and update the index.

GIN Indexes for Arrays and JSONB

GIN (Generalized Inverted Index) is designed for data types that contain multiple values — arrays, JSONB, full-text search vectors (tsvector). A GIN index is an inverted index: it maps each element value to the set of rows (TIDs) containing it.

-- Index a JSONB column for key/value containment queries:
CREATE INDEX ON products USING GIN (attributes);
-- Efficient query:
SELECT * FROM products WHERE attributes @> '{"color": "red", "size": "L"}';

-- Index an integer array column:
CREATE INDEX ON posts USING GIN (tag_ids);
-- Efficient query (rows where tag_ids contains 42):
SELECT * FROM posts WHERE tag_ids @> ARRAY[42];

GIN index builds are slower and the index is larger than B-tree, but containment (@>), overlap (&&), and membership (@) queries on indexed columns are dramatically faster than sequential scans. GIN indexes support gin_pending_list_limit (a write buffer to batch updates) to reduce write amplification.

GiST Indexes for Geometric and Range Types

GiST (Generalized Search Tree) is a framework for building indexes on complex data types using a balanced tree structure with user-defined search strategies. It supports types where B-tree ordering doesn’t apply:

Geometric types: Point, polygon, circle — with operators like <-> (distance), && (overlap), @> (contains). PostGIS uses GiST for spatial indexing — essential for ST_DWithin, ST_Intersects, and bounding box queries.
Range types: daterange, tsrange, int4range — with operators for overlap and containment. A GiST index on a daterange column supports efficient overlap queries: find all reservations that overlap with a requested date range.
Full-text search: GiST can also index tsvector (slower to build but supports lossy storage via signature trees; GIN is usually preferred for text search).

Bitmap Index Scan

PostgreSQL can combine multiple indexes using bitmaps when a query has AND/OR conditions on different indexed columns:

Each index is scanned independently, producing a bitmap of matching heap TIDs (one bit per heap page, or exact TIDs in "exact mode").
Bitmaps are combined with AND (intersection) or OR (union) using bitwise operations.
The resulting bitmap is used to fetch heap pages once, in physical order — minimizing random I/O compared to two independent index scans each fetching heap pages.

Bitmap scans appear in EXPLAIN output as Bitmap Index Scan + BitmapAnd/BitmapOr + Bitmap Heap Scan. This allows the planner to use two specialized single-column indexes instead of requiring a composite index for every combination of query predicates — a useful technique for ad-hoc reporting queries with varying filter conditions.

Index Bloat and Maintenance

PostgreSQL uses MVCC (Multi-Version Concurrency Control): UPDATE and DELETE create new heap row versions rather than modifying rows in place. Old versions become dead tuples. Index entries continue pointing to dead heap rows until VACUUM cleans them up.

Index bloat: Dead index entries accumulate. The index grows larger than needed, consuming more memory in the buffer cache and slowing scans. On tables with frequent UPDATE/DELETE, indexes can become 2-5x their logical size.
VACUUM: Standard VACUUM reclaims dead heap tuples and marks their index entries as dead, allowing space reuse. autovacuum runs VACUUM automatically based on dead tuple thresholds. Monitor with pg_stat_user_tables.n_dead_tup.
REINDEX: For severe bloat, REINDEX INDEX CONCURRENTLY rebuilds the index from scratch without locking the table. Produces a compact index at the cost of a full table scan. Schedule during low-traffic windows.

Query bloat percentage: SELECT * FROM pgstattuple('index_name') shows dead leaf percentage. More than 20-30% dead tuples is typically a signal to reindex or tune autovacuum aggressiveness for that table.

EXPLAIN ANALYZE: Reading Query Plans

EXPLAIN ANALYZE executes the query and returns the actual query plan with both estimated and actual row counts and timing. Key things to look for:

Seq Scan vs Index Scan: A sequential scan reads all table pages. Appropriate for small tables or queries returning a large fraction of rows. An Index Scan follows the B-tree and fetches heap pages per matching row — best for selective queries. A Bitmap Heap Scan batches heap fetches — best for moderately selective queries.
Cost model: (cost=startup..total). The planner uses statistics (row counts, column histograms in pg_statistic) to estimate costs. Costs are in arbitrary units where sequential page read = 1.0, random page read = 4.0 (configurable).
Actual rows vs estimated rows: A large discrepancy (e.g., estimated 1 row, actual 50,000 rows) indicates a bad cardinality estimate — often due to correlated columns or outdated statistics. Run ANALYZE tablename to refresh statistics. For correlated columns (city, state), create extended statistics: CREATE STATISTICS ON orders(city, state).
Loops: Nested loop joins show actual rows=X loops=Y. The actual total rows processed is X * Y. High loop counts with index scans inside can still be expensive if Y is large — consider a hash join.

Use EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) to also see buffer cache hits vs disk reads (Buffers: shared hit=X read=Y). A high read count on a query that should be cached indicates memory pressure or a missing index forcing large heap fetches.