Low Level Design: Database Index Selection Strategy

⏱ 4 min read

Database indexes dramatically accelerate query performance by providing direct access paths to rows, avoiding full table scans. However, indexes consume storage, slow down writes (every INSERT/UPDATE/DELETE must maintain each index), and can mislead the query optimizer if poorly chosen. The goal of index design is to maximize read performance for the most frequent, most latency-sensitive queries while minimizing write overhead and storage cost.

When Indexes Help and When They Don’t

Indexes help when: a query filters by the indexed column with high selectivity (WHERE user_id = 42 — returns few rows from a large table), queries sort by an indexed column (ORDER BY created_at DESC — the index provides the sorted order), and range queries on the indexed column (WHERE created_at BETWEEN a AND b). Indexes do not help when: the query returns most of the table rows (the query planner skips the index and does a sequential scan — faster for large result fractions), the column has low cardinality (WHERE status = ‘active’ when 90% of rows are active — index lookup + heap fetch is slower than a sequential scan), and the indexed column is never used in WHERE, JOIN, or ORDER BY clauses.

Composite Index Column Order

Composite index column order determines which queries the index can serve. An index on (user_id, created_at) can satisfy: WHERE user_id = 42 (left prefix), WHERE user_id = 42 AND created_at > T (full index use), and ORDER BY user_id, created_at (sorted retrieval). It cannot satisfy: WHERE created_at > T without user_id (no leftmost prefix — index cannot be used efficiently). Rule: put the equality predicate columns first (user_id = ?), then the range predicate columns (created_at > ?). High-cardinality equality columns before low-cardinality — this narrows the search space most aggressively at each level of the index tree.

Covering Indexes

A covering index includes all columns referenced by a query — the query engine reads only the index without touching the heap (the actual table rows). Example: SELECT user_id, email FROM users WHERE tenant_id = 5 ORDER BY created_at DESC. Index on (tenant_id, created_at, user_id, email) — all four columns referenced by the query are in the index. The query is answered entirely from the index tree with no heap access (index-only scan in PostgreSQL). Covering indexes are particularly valuable for: high-read, frequently-executed queries, queries over large tables where heap access is expensive, and indexes that are frequently filtered through the visibility map (PostgreSQL index-only scan requires all rows to be visible in the visibility map).

Partial Indexes

A partial index indexes only a subset of rows matching a predicate: CREATE INDEX active_orders_user ON orders(user_id) WHERE status = ‘active’. Only active orders are in the index — for a table with 10M orders where 99% are historical (completed/cancelled), the partial index is 100x smaller than a full index. Queries like WHERE user_id = 42 AND status = ‘active’ use the partial index efficiently. Use cases: soft-deleted rows (WHERE deleted_at IS NULL — index only non-deleted rows), unprocessed job queue (WHERE processed = false — index only pending jobs), recent data (WHERE created_at > NOW() – INTERVAL ’30 days’ — index only recent rows). Partial indexes reduce storage and maintenance cost while maintaining full query performance for the targeted subset.

Index Bloat and Maintenance

B-tree indexes accumulate dead entries when rows are updated or deleted — VACUUM reclaims them. Index pages that are never fully reclaimed fragment over time (index bloat). Monitor index bloat: pgstattuple extension (SELECT * FROM pgstattuple(‘index_name’)); alert when dead_leaf_percent > 20%. REINDEX rebuilds a bloated index (requires a lock in PostgreSQL = 12 rebuilds without lock). Unused indexes: every index is a write overhead cost. Identify unused indexes: SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0; drop indexes with zero scans since the last stats reset. Duplicate indexes: two indexes on (a, b) and (a) — the (a, b) index satisfies all queries that (a) would serve; the (a) index is redundant.

Index Types: B-tree vs Hash vs GIN vs GiST

B-tree: the default; supports equality, range, sorting, and LIKE ‘prefix%’ queries. Best for most cases. Hash: O(1) equality lookup; no support for ranges or sorting; rarely better than B-tree in practice. GIN (Generalized Inverted Index): for multi-valued data — full-text search (tsvector), JSONB containment (@>), array @> operators. One index entry per element of the multi-valued column. GiST (Generalized Search Tree): for geometric types (PostGIS), range types, and full-text search. More flexible than GIN for geometric operations. BRIN (Block Range Index): compact index for naturally ordered large tables (time-series, insert-order data) — stores min/max per page range; 1000x smaller than B-tree but only useful for sorted data. Bloom index: space-efficient for multi-column equality queries (WHERE a = ? AND b = ? AND c = ?).

Index Selection Process

Systematic index selection: (1) Identify the top N slowest queries (pg_stat_statements, slow query log) and the top N highest-volume queries. (2) For each query, run EXPLAIN ANALYZE and look for sequential scans on large tables. (3) Identify the filter columns (WHERE, JOIN ON), sort columns (ORDER BY), and return columns (SELECT) of each query. (4) Design a composite index that satisfies the query: equality filters first, then range filters, then ORDER BY columns, then SELECT-only columns (for covering). (5) Measure: EXPLAIN ANALYZE before and after; compare execution time and rows examined. (6) Deploy CONCURRENTLY (CREATE INDEX CONCURRENTLY — no table lock). (7) Monitor: confirm pg_stat_user_indexes shows the new index is being used.