Low Level Design: Database Query Optimizer

⏱ 5 min read

Query Parsing

Query execution begins with parsing the raw SQL text. A lexer (scanner) tokenizes the input into keywords, identifiers, literals, and operators. The parser consumes the token stream and builds an Abstract Syntax Tree (AST) that represents the syntactic structure of the query. Each node in the AST corresponds to a clause: SELECT list, FROM clause with join trees, WHERE predicates, GROUP BY, HAVING, ORDER BY, LIMIT.

After building the AST, semantic analysis resolves names against the catalog: table names are mapped to their storage descriptors, column references are resolved and type-checked, function names are looked up, and permission checks are performed. Errors at this stage (unknown table, type mismatch) are returned to the client before any optimization or execution.

Logical Plan

The validated AST is transformed into a logical plan — a tree of relational algebra operators. Tables become Scan nodes. Predicates become Filter nodes. Column selection becomes Project nodes. Joins, aggregations, and sorts each have corresponding operator nodes.

The planner applies equivalence-preserving rewrite rules to simplify the logical plan before cost estimation:

Predicate pushdown: Move filter conditions as close to the base table scans as possible. A filter applied before a join reduces the number of rows that participate in the (expensive) join.
Column pruning: Remove columns from scan operators that are not referenced by any ancestor node. Reduces I/O and memory for wide tables.
Subquery unnesting: Convert correlated subqueries (which execute once per outer row) into equivalent joins, enabling set-at-a-time execution instead of row-at-a-time.

Statistics Collection

The cost model is only as good as its estimates, and estimates are only as good as the statistics. A database catalog stores per-table and per-column statistics maintained by commands like PostgreSQL’s ANALYZE or MySQL’s ANALYZE TABLE.

Typical statistics collected:

Table-level: row count (n_live_tup), page count, average row width.
Column-level: number of distinct values (n_distinct), null fraction, most common values with frequencies, histogram of value distribution across equal-frequency buckets.
Index statistics: number of leaf pages, correlation between physical and logical row order (affects index scan cost).

PostgreSQL’s extended statistics (CREATE STATISTICS) capture correlations between columns, improving estimates for multi-column predicates where per-column independence assumptions lead to severe under- or over-estimates.

Cost Model

The optimizer assigns a cost to every candidate plan node. Cost is a dimensionless number (in PostgreSQL, units are "sequential page reads") combining I/O and CPU components:

cost = pages_read * seq_page_cost
     + index_pages_read * random_page_cost
     + rows_processed * cpu_tuple_cost
     + rows_filtered * cpu_operator_cost

Key operator costs:

Sequential scan: Read all table pages sequentially. Low per-page cost, but proportional to table size.
Index scan: Traverse B-tree to find matching tuples, then fetch heap pages. Each heap fetch is a random I/O. Wins when selectivity is high (few rows match).
Nested loop join: For each row in the outer relation, probe the inner relation. Cost = outer_rows × inner_scan_cost. Efficient when inner scan uses an index and outer is small.
Hash join: Build a hash table from the smaller relation, probe with the larger. Cost = build_cost + probe_cost. Requires a memory budget; spills to disk if exceeded.
Merge join: Scan two pre-sorted relations in lockstep. Efficient when both inputs are already sorted (e.g., from an index scan on the join key).

Join Ordering

For a query joining n tables, there are n! possible join orderings (and more if considering bushy vs. left-deep trees). Finding the optimal order is NP-hard in general.

The System R approach, used by PostgreSQL, applies dynamic programming:

Enumerate all 2^n subsets of the n tables.
For each subset S, compute the cheapest plan that joins exactly the tables in S.
Build S by combining a smaller subset S’ with one additional table, using all three physical join algorithms.
Keep only the cheapest plan for each (subset, interesting order) combination, where "interesting order" means a sort order useful to a parent operator (e.g., for a merge join or ORDER BY).

This runs in O(3^n) time. PostgreSQL uses DP for n ≤ 12 (join_collapse_limit), then falls back to a greedy GEQO (genetic query optimizer) algorithm for larger queries to avoid exponential planning time.

Physical Operator Selection

Once the logical join order is fixed, the optimizer selects physical implementations for each operator. This is a secondary optimization pass that evaluates multiple physical alternatives:

Scan: Sequential scan, index scan, index-only scan (covers all needed columns), bitmap index scan (batches random I/Os into sorted heap fetches), or a combination of multiple indexes via bitmap AND/OR.
Join: Nested loop (best for small outer + indexed inner), hash join (best for large equality joins with no useful sort order), merge join (best when inputs are already sorted or a sort is needed anyway for ORDER BY).
Aggregation: Hash aggregate (group into hash table, one pass) vs. sort + stream aggregate (sort by group key, then scan linearly — avoids memory limits).

The optimizer generates all valid physical plans for the chosen logical plan and picks the minimum-cost one. GUC parameters like enable_hashjoin = off can disable specific strategies for debugging or forcing a plan.

Adaptive Query Execution

Static optimization fails when cardinality estimates are wrong — a common occurrence with skewed data, correlated predicates, or after data changes that outpace statistics updates. Adaptive query execution (AQE) re-optimizes at runtime using actual statistics collected during execution.

Apache Spark’s AQE (introduced in Spark 3.0) re-optimizes after each shuffle stage: it reads actual partition size statistics from completed stages to make better decisions for downstream stages — dynamically coalescing small partitions, converting sort-merge joins to broadcast hash joins when the build side turns out to be small, and skew join handling.

PostgreSQL does not re-optimize mid-query but has made estimation more robust through multi-column extended statistics, improved MCV lists, and the JIT-compiled execution engine (reducing the cost of plan changes). Some commercial databases (SQL Server, Oracle) support re-optimization via "cardinality feedback" across query executions.

Caching Query Plans

Query parsing and optimization are expensive. Prepared statements amortize this cost by caching the plan across executions.

PostgreSQL uses a two-mode strategy for generic (parameterized) plans:

Custom plan: Re-optimized on each execution with the actual parameter values substituted. Better plan quality but pays planning overhead every time.
Generic plan: Planned once without specific parameter values. Reused across executions. PostgreSQL switches to a generic plan after 5 custom-plan executions if its estimated cost is within plan_cache_mode threshold.

Plan caches must be invalidated when the underlying schema or statistics change significantly — PostgreSQL does this by tracking a catalog version counter. In connection poolers like PgBouncer in transaction-mode pooling, prepared statements are not forwarded to the server, which forces re-planning on every execution — a common source of performance surprises when migrating to a pooler.