System Design: Database Indexing Strategies — B-Tree, Hash, GIN, Partial, Composite, Covering Index, Query Performance

Database indexes are the single most impactful tool for query performance. A missing index can make a query 1000x slower, while the wrong index wastes storage and slows writes. This guide provides a comprehensive overview of indexing strategies — which index types exist, when to use each, how composite indexes work, and common indexing mistakes — essential knowledge for backend engineering interviews and production database optimization.

How B-Tree Indexes Work

B-tree is the default and most versatile index type. It maintains a balanced tree of sorted keys with O(log N) lookup time. Each internal node contains keys and pointers to child nodes. Leaf nodes contain keys and pointers to the actual table rows (or the row data itself in clustered indexes). B-tree supports: equality (WHERE id = 5), range (WHERE price BETWEEN 10 AND 50), prefix (WHERE name LIKE “John%”), and ORDER BY (the index is already sorted). A B-tree index on a 10-million-row table has approximately 3-4 levels. A lookup traverses 3-4 nodes (3-4 disk reads), compared to a full table scan of potentially thousands of pages. Write overhead: every INSERT, UPDATE (on indexed column), and DELETE must also update the index. A table with 5 indexes requires 6 writes per insert (1 table + 5 indexes). This is why adding too many indexes slows write performance. Index size: typically 10-30% of the table size, depending on the indexed column type and table row size.

Composite (Multi-Column) Indexes

A composite index covers multiple columns. CREATE INDEX idx ON orders(user_id, created_at). The index is sorted first by user_id, then by created_at within each user_id. The leftmost prefix rule: the index is useful for queries that filter on the leftmost columns. This index supports: WHERE user_id = 5 (uses the first column), WHERE user_id = 5 AND created_at > “2026-01-01” (uses both columns), and WHERE user_id = 5 ORDER BY created_at (index provides sorting for free). It does NOT support: WHERE created_at > “2026-01-01” alone (the first column is not constrained — the database cannot use the index efficiently). Column ordering matters: put the most selective (highest cardinality) column first, or the column used in equality conditions first. Equality before range: if you filter WHERE user_id = 5 AND created_at BETWEEN x AND y, put user_id first (equality), then created_at (range). The index efficiently narrows to user_id = 5, then range-scans created_at within that user.

Covering Indexes and Index-Only Scans

A covering index includes all columns needed by a query, allowing the database to answer the query from the index alone without accessing the table. Example: SELECT user_id, created_at FROM orders WHERE user_id = 5. If the index is on (user_id, created_at), both the filter column and the selected columns are in the index. PostgreSQL calls this an “Index Only Scan” — it reads only the index, skipping the table entirely. This is 2-10x faster than an Index Scan (which reads the index to find row pointers, then reads the table for the selected columns). CREATE INDEX idx ON orders(user_id, created_at) INCLUDE (total_amount). The INCLUDE clause adds total_amount to the index leaf nodes without affecting the sort order. Now SELECT user_id, created_at, total_amount FROM orders WHERE user_id = 5 ORDER BY created_at is an Index Only Scan. Trade-off: covering indexes are larger (more data in the index), which increases storage and write overhead. Use them for critical, frequently executed queries where the performance improvement justifies the cost.

Partial and Expression Indexes

A partial index indexes only a subset of rows matching a condition. CREATE INDEX idx_pending ON orders(created_at) WHERE status = “pending”. This index is much smaller than a full index on created_at (if only 5% of orders are pending, the index is 20x smaller). Queries that include WHERE status = “pending” use this smaller, faster index. Queries without the status filter cannot use it. Use partial indexes when: queries frequently filter on a specific condition (active users, pending orders, unprocessed events), and the matching rows are a small fraction of the total. Expression indexes (functional indexes): index on a function of a column. CREATE INDEX idx_lower_email ON users(LOWER(email)). Without this, WHERE LOWER(email) = “user@example.com” cannot use a regular index on email (the function prevents index usage). The expression index pre-computes and stores the lowered values. PostgreSQL and SQLite support expression indexes. MySQL supports generated columns with indexes (similar effect). Common use: case-insensitive search, date extraction (CREATE INDEX idx ON events(DATE(created_at))), and JSON field extraction.

GIN, GiST, and Specialized Indexes

GIN (Generalized Inverted Index): indexes elements within composite values. Use for: full-text search (tsvector columns), JSONB containment (WHERE data @> value), and array operations (WHERE tags @> ARRAY[“python”]). GIN creates an entry for each element (each word in a text, each key in a JSON object, each element in an array). Lookups are fast but writes are slower (updating the inverted index). GiST (Generalized Search Tree): for geometric, range, and nearest-neighbor queries. Use for: PostGIS spatial data (find restaurants within 5 km), range types (overlapping time ranges for scheduling), and full-text search (alternative to GIN, smaller but slower). BRIN (Block Range INdex): extremely small indexes for physically sorted data. If the orders table is naturally sorted by created_at (recent orders are at the end of the table), a BRIN index stores the min/max created_at per block of pages. A range query WHERE created_at > “2026-04-01” skips blocks where max < "2026-04-01". BRIN indexes are 100-1000x smaller than B-tree but only effective when data is physically ordered.

Common Indexing Mistakes

Mistakes that hurt performance: (1) Missing indexes on foreign keys — JOIN operations on unindexed foreign keys cause full table scans on the joined table. Always index foreign key columns. (2) Too many indexes — each index slows writes and consumes storage. A table with 10 indexes may have 10x write amplification. Index only columns used in WHERE, JOIN, and ORDER BY. (3) Wrong column order in composite indexes — WHERE status = “active” AND created_at > “2026-01-01” needs an index on (status, created_at), not (created_at, status). Equality columns first, range columns last. (4) Indexing low-cardinality columns alone — an index on a boolean column (active: true/false) is rarely useful. The database reads half the table either way. Combine it in a partial index or composite index. (5) Not using EXPLAIN ANALYZE — the query planner may not use your index due to stale statistics, type mismatches, or function calls on indexed columns. Always verify with EXPLAIN ANALYZE. (6) Over-relying on ORM-generated queries — ORMs may generate N+1 queries or SELECT * that defeat index optimization. Review the SQL your ORM generates for critical queries.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does a composite index work and why does column order matter?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A composite index on (user_id, created_at) sorts data first by user_id, then by created_at within each user_id. The leftmost prefix rule: the index is useful only for queries that constrain the leftmost columns. This index supports: WHERE user_id = 5 (first column), WHERE user_id = 5 AND created_at > date (both columns), WHERE user_id = 5 ORDER BY created_at (index provides free sorting). It does NOT support: WHERE created_at > date alone (first column not constrained). Column ordering rule: put equality columns before range columns. For WHERE user_id = 5 AND created_at BETWEEN x AND y: index (user_id, created_at) efficiently narrows to user_id=5 then range-scans dates. Index (created_at, user_id) range-scans dates first (potentially huge range) then filters by user_id — much worse. Generally: highest-selectivity equality column first, then additional equality columns, then range/sort columns last.”}},{“@type”:”Question”,”name”:”What is a covering index and how does it improve performance?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A covering index includes all columns needed by a query, allowing the database to answer entirely from the index without accessing the table rows. Example: SELECT user_id, created_at, status FROM orders WHERE user_id = 5 ORDER BY created_at. Index: CREATE INDEX ON orders(user_id, created_at) INCLUDE (status). The INCLUDE clause adds status to the index leaf nodes without affecting sort order. PostgreSQL uses an Index Only Scan — 2-10x faster than a regular Index Scan (which reads the index for row pointers, then fetches each row from the table). Trade-off: covering indexes are larger (more data stored in the index), increasing storage and write overhead. Use them for high-frequency critical queries where the speedup justifies the cost. Do not create covering indexes for rarely executed queries.”}},{“@type”:”Question”,”name”:”When should you use a partial index?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A partial index covers only rows matching a condition. Example: CREATE INDEX ON orders(created_at) WHERE status = pending. If only 5% of orders are pending, this index is 20x smaller than a full index on created_at. Smaller indexes are faster to scan, consume less memory, and have lower write overhead. Use partial indexes when: (1) Queries frequently filter on a specific condition (active users, pending items, unprocessed events). (2) The matching rows are a small fraction of the total table. (3) You want to enforce a unique constraint on a subset of rows (CREATE UNIQUE INDEX ON users(email) WHERE deleted_at IS NULL — unique emails among non-deleted users). The query must include the partial index condition in its WHERE clause for the planner to use it. WHERE status = pending AND created_at > date uses the partial index. WHERE created_at > date alone cannot use it.”}},{“@type”:”Question”,”name”:”What are the most common database indexing mistakes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Mistakes that hurt performance: (1) Missing foreign key indexes — JOINs on unindexed foreign keys cause full table scans. Always index FK columns. (2) Too many indexes — each index adds write overhead. A table with 10 indexes may be 10x slower for inserts. Index only columns in WHERE, JOIN, and ORDER BY. (3) Wrong composite index column order — equality columns must come before range columns. (4) Indexing low-cardinality columns alone — an index on a boolean (true/false) reads half the table either way. Use in a composite or partial index instead. (5) Functions on indexed columns — WHERE LOWER(email) = value cannot use a regular index on email. Create a functional index: CREATE INDEX ON users(LOWER(email)). (6) Not verifying with EXPLAIN ANALYZE — the planner may ignore your index due to stale statistics or type mismatches. Always check. (7) SELECT * defeating Index Only Scans — selecting all columns forces the database to read the table even when the index covers the WHERE and ORDER BY.”}}]}
Scroll to Top