System Design Interview: Database Indexing and Query Optimization

⏱ 13 min read

Why Indexing Matters

Without an index, a SELECT with a WHERE clause requires a sequential scan — reading every row in the table (O(N)). For a table with 100 million rows, this means reading hundreds of GBs of data per query. A B-tree index reduces this to O(log N) — about 27 levels for 100 million rows. The difference between a 10-second sequential scan and a 1-millisecond index lookup on a single query is the difference between a working product and an unusable one. Understanding when and how indexes work is the most practical database skill for system design interviews.

B-Tree Index Internals

PostgreSQL, MySQL InnoDB, and most OLTP databases use B-tree (B+ tree) indexes. The index is a balanced tree where: leaf nodes contain indexed values + pointers to heap rows; internal nodes guide the search; all leaf nodes are linked in sorted order (enabling range scans).


-- Creating an index:
CREATE INDEX idx_users_email ON users(email);

-- What the B-tree looks like (simplified):
Internal node: [alice@ | bob@ | carol@]
Leaf nodes:    [alice@example.com → row(5)]
               [bob@example.com → row(12)]
               [carol@example.com → row(3)]

-- Index lookup for exact match:
SELECT * FROM users WHERE email = 'bob@example.com'
-- 1. Binary search internal nodes: O(log N)
-- 2. Follow pointer to heap row: O(1)
-- Total: O(log N) — ~27 comparisons for 100M rows

-- Range scan:
SELECT * FROM users WHERE email BETWEEN 'a' AND 'c'
-- 1. Find starting leaf node
-- 2. Follow linked list through leaf nodes until end of range
-- Efficient even for large ranges

Composite Index Column Ordering

Composite indexes sort by the first column, then the second, and so on — like a phone book (sorted by last name, then first name). This means the leftmost prefix of the index must match your query for the index to be used.


CREATE INDEX idx_orders_user_status ON orders(user_id, status, created_at);

-- Uses index (leftmost prefix matches):
SELECT * FROM orders WHERE user_id = 42                        -- ✓ (user_id)
SELECT * FROM orders WHERE user_id = 42 AND status = 'paid'   -- ✓ (user_id, status)
SELECT * FROM orders WHERE user_id = 42 AND status = 'paid'
  AND created_at > '2024-01-01'                               -- ✓ (all three columns)

-- Does NOT use index:
SELECT * FROM orders WHERE status = 'paid'                     -- ✗ (no user_id prefix)
SELECT * FROM orders WHERE created_at > '2024-01-01'           -- ✗ (no prefix)

-- Ordering rule: put equality predicates first, range predicates last
-- BAD: (created_at, user_id) — range on first column breaks prefix for user_id
-- GOOD: (user_id, status, created_at) — equality first, range last

Covering Indexes

A covering index includes all columns needed by the query — data can be served entirely from the index without touching the heap. This eliminates the heap access step, significantly faster for high-selectivity queries or large tables where heap rows are scattered across disk.


-- Query:
SELECT user_id, status, total FROM orders WHERE user_id = 42 ORDER BY created_at;

-- Standard index (user_id): must fetch row from heap for each result
CREATE INDEX idx_orders_user ON orders(user_id);
-- For 1000 matching rows: 1000 random heap reads (potentially 1000 I/O ops)

-- Covering index: all needed columns in the index
CREATE INDEX idx_orders_covering ON orders(user_id, created_at)
INCLUDE (status, total);
-- PostgreSQL INCLUDE: key columns + non-key columns stored in leaf
-- For 1000 matching rows: 0 heap reads — all data from index

-- Covering index in MySQL:
CREATE INDEX idx_covering ON orders(user_id, created_at, status, total);
-- MySQL InnoDB always includes primary key in secondary indexes

EXPLAIN ANALYZE: Reading Query Plans


EXPLAIN ANALYZE
SELECT o.*, u.name
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending' AND o.created_at > NOW() - INTERVAL '7 days'
ORDER BY o.created_at DESC
LIMIT 100;

-- Sample output:
Limit (cost=150.45..150.70 rows=100) (actual time=12.3..12.4 rows=100)
  -> Sort (cost=... actual time=12.2..12.3 rows=100)
       Sort Key: o.created_at DESC
       Sort Method: top-N heapsort  Memory: 25kB
    -> Hash Join (cost=85..1200 actual time=2.1..11.5 rows=1847)
         Hash Cond: (o.user_id = u.id)
         -> Index Scan using idx_orders_status_created on orders o
               (cost=0.43..950 actual time=0.08..8.2 rows=1847)
               Index Cond: (status = 'pending')
               Filter: (created_at > (now() - '7 days'::interval))
               Rows Removed by Filter: 12450
         -> Hash (cost=45..45 rows=3200 actual time=1.8..1.8 rows=3200)
               Buckets: 4096 Batches: 1 Memory Usage: 215kB
               -> Seq Scan on users u (...)

-- Key things to look for:
-- "Seq Scan" on large tables: missing index
-- "Rows Removed by Filter": high number → index not selective enough
-- "Hash Join" vs "Nested Loop": hash join for large tables, nested loop for small
-- actual time vs estimated cost: large discrepancy → outdated statistics (run ANALYZE)
-- "Sort Method: external merge Disk": sort spilling to disk → increase work_mem

Index Types

Partial index: index only rows matching a condition. Smaller and faster for queries on a subset. CREATE INDEX idx_pending ON orders(created_at) WHERE status = 'pending'. Only useful if the query also includes the WHERE condition.
Expression index (functional index): index the result of a function. CREATE INDEX idx_lower_email ON users(lower(email)). Enables case-insensitive lookups without full table scans.
GIN index: generalized inverted index for composite types — JSONB, arrays, full-text search vectors. CREATE INDEX idx_tags ON posts USING GIN(tags). Enables WHERE 'go' = ANY(tags) without full scan.
GiST index: generalized search tree for geometric types, range types, and full-text search. Used for PostGIS spatial queries.
BRIN index (Block Range Index): very small index for naturally ordered data (time-series). Stores min/max of each block range. 1/1000th the size of a B-tree, effective for timestamp columns where rows are appended in order.

Common Performance Anti-Patterns


-- 1. Function on indexed column (prevents index use)
-- BAD:
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- FIX: use expression index OR store email lowercased

-- 2. Leading wildcard (prevents index use)
-- BAD:
SELECT * FROM products WHERE name LIKE '%widget%';
-- FIX: use full-text search (GIN index on tsvector) or Elasticsearch

-- 3. OR with non-indexed columns
-- BAD:
SELECT * FROM orders WHERE status = 'paid' OR notes LIKE '%urgent%';
-- FIX: split into two queries and UNION, or add index on notes

-- 4. Implicit type conversion
-- BAD (user_id is INT, passing VARCHAR):
SELECT * FROM orders WHERE user_id = '42';
-- PostgreSQL may cast, which can prevent index use (depends on direction)
-- FIX: use correct type: WHERE user_id = 42

-- 5. SELECT * (over-fetching columns)
-- Prevents covering index optimization, transfers unnecessary data over network
-- FIX: SELECT only needed columns

-- 6. N+1 query problem
-- BAD: fetch 100 orders, then query user for each
orders = db.query("SELECT * FROM orders LIMIT 100")
for order in orders:
    user = db.query("SELECT * FROM users WHERE id = %s", order.user_id)  # N+1!
-- FIX: JOIN in the original query or use DataLoader batching
orders = db.query("""
    SELECT o.*, u.name FROM orders o
    JOIN users u ON o.user_id = u.id
    LIMIT 100
""")

LSM Tree Indexes (NoSQL)

Cassandra, RocksDB, LevelDB, and ScyllaDB use LSM (Log-Structured Merge) trees instead of B-trees. LSM is optimized for write-heavy workloads: all writes go to an in-memory memtable (no random I/O). When the memtable is full, it is flushed to disk as an immutable SSTable (Sorted String Table). Reads must check the memtable and potentially multiple SSTables (expensive). Compaction periodically merges SSTables, removing deleted entries and reducing read amplification.


# LSM write path (fast):
1. Write to in-memory memtable: O(log N) in-memory BST — no disk I/O
2. Append to commit log (WAL) for durability: 1 sequential disk write
3. Return success to client

# LSM read path (potentially slow):
1. Check memtable (most recent writes)
2. Check Bloom filter for each SSTable (probabilistic: is key in this SSTable?)
3. Binary search SSTable(s) where Bloom filter says yes
# Read amplification: may check many SSTables; mitigated by Bloom filters and tiered compaction

# B-tree vs LSM trade-offs:
# B-tree: O(log N) reads, O(log N) writes (random I/O on writes) — good for OLTP
# LSM: O(1) amortized writes (sequential), O(log N) reads (amplified) — good for time-series, logging

Interview Questions

A query that was running in 10ms suddenly takes 30 seconds — walk me through your debugging process
How would you index a table to optimize both WHERE user_id = X and WHERE user_id = X AND status = 'active' queries?
What is a covering index and when does it help?
When would you choose an LSM-based database over a B-tree database?
How do you handle indexes in a write-heavy system where index maintenance is a bottleneck?

Frequently Asked Questions

Why does the order of columns in a composite index matter?

A composite index sorts data by the first column, then the second within ties, then the third, and so on — exactly like a phone book sorted by last name, then first name. The database can only use the index efficiently if the query's predicates match the index columns from left to right (the leftmost prefix rule). If you have an index on (user_id, status, created_at), a query filtering on user_id can use the index. A query filtering on user_id AND status can use the index for both columns. A query filtering only on status cannot use the index at all — there is no user_id prefix. This is why column ordering is critical: put equality columns first (columns used in WHERE col = value) and range columns last (WHERE col > value, WHERE col BETWEEN). Range predicates on a column stop the index from being useful for columns after it. Example: index (status, user_id) with query WHERE status = 'paid' AND user_id = 42 would work, but index (user_id, status) WHERE user_id = 42 AND status > 'n' works better because the equality predicate on user_id narrows the search first, then the range on status. The practical rule: for a query with multiple predicates, design the index in the order: (1) equality predicates, (2) sort-by column, (3) range predicate. This rule is sometimes called the Equality-Sort-Range (ESR) rule.

What is a covering index and when does it eliminate heap access?

A covering index is an index that contains all the columns needed to satisfy a query — the query planner can return results directly from the index without accessing the main table (heap). A standard index stores the indexed column values and pointers (row addresses) to the actual table rows. When a query needs columns beyond those indexed, the database follows each pointer to the heap to fetch the full row — this is called a "heap fetch" or "table access by rowid." For large tables with rows scattered across many pages, these random heap accesses are expensive I/O operations. A covering index eliminates this: if the index includes all columns the query selects and filters on, the planner uses an index-only scan (PostgreSQL) or covering index scan (MySQL). PostgreSQL's INCLUDE clause adds non-key columns to the index leaf pages without affecting ordering. Example: CREATE INDEX idx_covering ON orders(user_id, created_at) INCLUDE (status, total). A query SELECT status, total FROM orders WHERE user_id = 42 ORDER BY created_at can be served entirely from the index. The downside: covering indexes are larger and slower to write (more data to maintain on every INSERT/UPDATE/DELETE). The trade-off is worth it for frequently-executed, read-heavy queries on large tables where random heap I/O is measured to be the bottleneck.

How do you debug a suddenly slow database query?

A systematic approach: (1) Run EXPLAIN ANALYZE on the slow query. Look for: Sequential Scan on a large table (needs an index); high "Rows Removed by Filter" (index not selective enough); time discrepancy between estimated and actual rows (stale statistics — run ANALYZE table_name); "Sort Method: external merge Disk" (spilling to disk — increase work_mem). (2) Check if an index exists and is being used: run EXPLAIN and look for "Index Scan" vs "Seq Scan". If an index exists but is not used, the planner may have decided a sequential scan is cheaper (this can happen if the index is not selective enough — retrieving 50% of rows is faster with a seq scan). (3) Check table statistics freshness: SELECT last_analyze, n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = 'orders'. High n_dead_tup means the table needs VACUUM; stale statistics from autovacuum lag cause bad plans. Run ANALYZE manually if needed. (4) Check for lock contention: SELECT * FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE wait_event IS NOT NULL. A query waiting on a lock appears slow but is not actually executing. (5) Check for index bloat: deleted rows leave dead index entries that must be scanned. REINDEX rebuilds the index; CREATE INDEX CONCURRENTLY creates a new index without locking. (6) If the query recently changed (new code path, different parameter values), check if the parameter value is leading to a different plan (parameter sniffing). PostgreSQL's plan cache can cache a plan optimized for one value that performs poorly for another.

Companies That Ask This Question

Stripe Engineering Interview Guide

Uber Engineering Interview Guide

Shopify Engineering Interview Guide

Airbnb Engineering Interview Guide

Databricks Engineering Interview Guide