Database indexes are data structures that allow the database engine to find rows matching a query condition without scanning the entire table. Without indexes, every query performs a full table scan — O(n) where n is table size. With the right index, queries become O(log n) or O(1). PostgreSQL, MySQL, and other databases support multiple index types: B-tree (default, range queries), hash (equality only), GIN (full-text, arrays), GiST (geographic, range types), and BRIN (correlation-based, append-only tables). Designing indexes correctly is one of the most impactful performance optimizations available and a common system design interview topic.
B-Tree Index Internals
A B-tree index maintains a sorted tree structure where leaf nodes contain index key values and pointers (heap tuple IDs) to the corresponding table rows. Internal nodes contain separator keys for navigating the tree. B-trees are balanced — all leaf nodes are at the same depth, ensuring O(log n) search time regardless of data distribution. B-tree indexes support: equality lookups (WHERE id = 42), range queries (WHERE age BETWEEN 20 AND 30), prefix matching (WHERE name LIKE 'John%'), and ORDER BY without sorting (the index is already sorted). B+ trees (used by PostgreSQL, MySQL InnoDB): only leaf nodes store data pointers; leaf nodes are linked as a doubly-linked list enabling efficient range scans without backtracking up the tree.
-- Index design examples
-- Composite index: column order matters
-- (user_id, created_at) vs (created_at, user_id)
-- Good for: WHERE user_id = 123 ORDER BY created_at DESC → uses index fully
-- Good for: WHERE user_id = 123 → uses index (left prefix)
-- Bad for: WHERE created_at > NOW() - INTERVAL 1 day → cannot use left prefix
CREATE INDEX idx_posts_user_date ON posts (user_id, created_at DESC);
-- Covering index: index contains all columns needed by the query
-- Query reads index-only, never touches table heap (faster)
CREATE INDEX idx_orders_covering ON orders (user_id, status)
INCLUDE (amount, created_at);
-- SELECT amount, created_at FROM orders WHERE user_id=1 AND status='paid'
-- → Index-only scan, no heap access
-- Partial index: index only a subset of rows
-- 90% of orders are completed -- only index pending orders
CREATE INDEX idx_orders_pending ON orders (created_at)
WHERE status = 'pending';
-- Expression index: index on computed value
CREATE INDEX idx_users_lower_email ON users (LOWER(email));
-- WHERE LOWER(email) = LOWER('User@Example.COM') → uses index
Index Selection and Query Planning
The query planner decides whether to use an index or perform a table scan based on cost estimates. For low-selectivity predicates (e.g., WHERE status = 'active' when 95% of rows are active), a full table scan is cheaper than an index scan (index scan requires random heap access per matching row). For high-selectivity predicates (e.g., WHERE user_id = 12345), an index scan is much cheaper. EXPLAIN ANALYZE shows the actual execution plan and costs. Table statistics (collected by ANALYZE) inform the planner’s row count estimates — stale statistics cause poor plan choices. Key index anti-patterns: indexing a column used only in SELECT (not WHERE/JOIN/ORDER BY); creating duplicate indexes; over-indexing write-heavy tables (each index adds write overhead).
Specialized Index Types
Hash index: O(1) equality lookup but no range queries, no ordering. PostgreSQL hash indexes are now crash-safe (PG10+). Best for columns with very high cardinality used only in equality predicates. GIN (Generalized Inverted Index): indexes composite values — full-text search (stores per-word posting lists), JSONB key search (indexes all keys), array element search. Used by PostgreSQL full-text search and JSONB containment queries. GiST (Generalized Search Tree): supports geometric operators (PostGIS geometry intersects), range type overlaps, nearest-neighbor search. BRIN (Block Range Index): stores min/max per block of heap pages. Extremely small (a few KB for billions of rows) for naturally ordered data (append-only time-series). Efficiently skips blocks that cannot satisfy a range predicate.
Key Interview Discussion Points
- Left-prefix rule for composite indexes: a composite index (a, b, c) can be used for queries filtering on a, (a,b), or (a,b,c) — but NOT on b alone or c alone; the leftmost column must be present in the predicate for the index to be usable
- Index bloat: PostgreSQL B-tree indexes accumulate dead tuples from UPDATE and DELETE (MVCC); VACUUM reclaims space; REINDEX rebuilds the index from scratch to eliminate bloat
- Clustered index (InnoDB primary key): InnoDB stores table data sorted by primary key (clustered); secondary indexes store the primary key value as a pointer (double lookup for secondary index reads); choose the primary key carefully — it determines physical row order
- Index-only scan: if the index contains all columns needed by the query (covering index), PostgreSQL can return results without accessing the table heap — eliminates random I/O for each matching row
- Concurrent index creation: CREATE INDEX CONCURRENTLY builds the index without blocking writes (takes longer, uses more resources) — essential for production tables where downtime is unacceptable