Query Optimizer Low-Level Design
A query optimizer transforms a parsed SQL query into an efficient execution plan. The optimizer evaluates many possible plans (different join orders, index choices, access methods) and selects the one with the lowest estimated cost. The quality of the cost estimate determines plan quality: bad statistics produce bad plans.
Statistics Collection
The optimizer relies on table and column statistics to estimate row counts and selectivity:
- Table statistics: row_count (total rows), page_count (disk pages), updated_at (staleness indicator).
- Column statistics: cardinality (distinct value count), null_fraction, and a histogram of value distribution (equal-width or equal-depth buckets). Histograms allow the optimizer to estimate selectivity for range predicates like
WHERE age BETWEEN 30 AND 40.
Statistics are collected by a background ANALYZE job triggered after a threshold of row changes (e.g., 10% of table size). Stale statistics are the most common cause of bad query plans in production.
Cost Model
Each plan node has an estimated cost composed of:
- I/O cost: estimated disk pages read. Sequential scans read all pages; index scans read index pages plus heap pages for matched rows.
- CPU cost: estimated tuple processing cost. Joins, filters, and projections each have a per-tuple CPU coefficient.
Total plan cost = sum of (I/O cost + CPU cost) across all nodes in the plan tree. The optimizer selects the plan with the minimum total cost.
Join Order Enumeration
For N tables in a query, the number of possible join orders is O(N!). Optimizers use different strategies based on N:
- Small N (up to ~8 tables): dynamic programming over subsets. The optimizer computes the optimal join order for every subset of tables bottom-up, reusing sub-results. This is the System R / Selinger algorithm.
- Large N: heuristics such as greedy nearest-neighbor (join the table that produces the smallest intermediate result at each step) or genetic algorithm (GEQO in Postgres). These sacrifice optimality for planning time.
Index Selection
For each table access, the optimizer considers available indexes:
- Selectivity: an index is beneficial when the predicate selects a small fraction of rows. If selectivity is above ~10-20%, a sequential scan is cheaper due to sequential I/O vs random I/O.
- Covering index: if all columns referenced by the query are in the index, the optimizer can perform an index-only scan, skipping heap access entirely. This is the most impactful index optimization for read-heavy queries.
- Multi-column index: column order matters. The optimizer uses the index only if predicates match the leading columns.
Predicate Pushdown
Predicate pushdown moves WHERE filter conditions as close to the table scan as possible in the plan tree. This reduces the number of rows flowing up through joins and aggregations, directly reducing I/O and CPU cost. For example, a filter on a joined table's column is pushed below the join into the table scan of that table.
Plan Caching
Parsing and optimizing a query is CPU-intensive. The plan cache stores the execution plan for a normalized query fingerprint (parameter values replaced with placeholders). On re-execution, the fingerprint is looked up in the cache and the cached plan is executed directly. Plan cache invalidation is triggered by schema changes (DDL) or significant statistics drift (after ANALYZE). Adaptive plan caching re-optimizes plans that have been running with a significantly different actual vs estimated row count.
Slow Query Detection and Plan Capture
Queries exceeding an execution time threshold are logged to the SlowQueryLog with the full execution plan and actual vs estimated row counts. This data drives index recommendations and manual plan tuning. Comparing estimated vs actual rows at each node exposes where the optimizer's statistics model breaks down.
SQL Schema
CREATE TABLE TableStatistic (
table_name TEXT PRIMARY KEY,
row_count BIGINT NOT NULL,
page_count BIGINT NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE ColumnStatistic (
id BIGSERIAL PRIMARY KEY,
table_name TEXT NOT NULL,
column_name TEXT NOT NULL,
cardinality BIGINT NOT NULL,
null_fraction NUMERIC(5,4) NOT NULL DEFAULT 0,
histogram JSONB,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (table_name, column_name)
);
CREATE TABLE QueryPlanCache (
query_fingerprint TEXT PRIMARY KEY,
plan JSONB NOT NULL,
estimated_cost NUMERIC NOT NULL,
hit_count BIGINT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
invalidated_at TIMESTAMPTZ
);
CREATE TABLE SlowQueryLog (
id BIGSERIAL PRIMARY KEY,
query_hash TEXT NOT NULL,
sql_text TEXT NOT NULL,
plan JSONB NOT NULL,
execution_ms INTEGER NOT NULL,
captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON SlowQueryLog (execution_ms DESC);
CREATE INDEX ON SlowQueryLog (query_hash, captured_at DESC);
Python Implementation
import json
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
@dataclass
class PlanNode:
node_type: str # SeqScan, IndexScan, HashJoin, etc.
table_name: str
estimated_rows: float
estimated_cost: float
children: List["PlanNode"] = None
def __post_init__(self):
if self.children is None:
self.children = []
def estimate_cost(plan_node: PlanNode, stats: Dict[str, Any]) -> float:
SEQ_PAGE_COST = 1.0
RANDOM_PAGE_COST = 4.0
CPU_TUPLE_COST = 0.01
CPU_OPERATOR_COST = 0.0025
ts = stats.get(plan_node.table_name, {})
page_count = ts.get("page_count", 1000)
row_count = ts.get("row_count", 10000)
if plan_node.node_type == "SeqScan":
io_cost = page_count * SEQ_PAGE_COST
cpu_cost = row_count * CPU_TUPLE_COST
elif plan_node.node_type == "IndexScan":
matched_rows = plan_node.estimated_rows
io_cost = matched_rows * RANDOM_PAGE_COST
cpu_cost = matched_rows * (CPU_TUPLE_COST + CPU_OPERATOR_COST)
elif plan_node.node_type in ("HashJoin", "NestedLoopJoin", "MergeJoin"):
child_costs = sum(estimate_cost(c, stats) for c in plan_node.children)
join_rows = plan_node.estimated_rows
cpu_cost = join_rows * CPU_OPERATOR_COST
io_cost = 0
return child_costs + cpu_cost
else:
io_cost = 0
cpu_cost = plan_node.estimated_rows * CPU_TUPLE_COST
child_costs = sum(estimate_cost(c, stats) for c in plan_node.children)
return io_cost + cpu_cost + child_costs
def select_join_order(tables: List[str], predicates: List[Dict], stats: Dict) -> List[str]:
"""Greedy nearest-neighbor heuristic: join smallest intermediate result first."""
if not tables:
return []
remaining = list(tables)
order = [remaining.pop(0)]
while remaining:
best = min(remaining, key=lambda t: stats.get(t, {}).get("row_count", float("inf")))
order.append(best)
remaining.remove(best)
return order
def recommend_index(slow_query: Dict) -> Optional[str]:
"""
Heuristic: if the plan contains a SeqScan on a large table with a WHERE predicate,
recommend an index on the filtered column.
"""
plan = slow_query.get("plan", {})
table = plan.get("table_name")
node_type = plan.get("node_type")
filter_col = plan.get("filter_column")
estimated_rows = plan.get("estimated_rows", 0)
if node_type == "SeqScan" and estimated_rows > 10000 and filter_col:
return f"CREATE INDEX ON {table} ({filter_col});"
return None
Frequently Asked Questions
How does statistics staleness cause bad query plans?
The optimizer uses row count and histogram data to estimate selectivity. If a table has grown from 10,000 rows to 10 million rows since the last ANALYZE, the optimizer still plans for 10,000 rows. It may choose a nested loop join that is efficient for small tables but catastrophic at scale, or skip an index that would filter 99% of rows. Regular ANALYZE (triggered by autovacuum or explicit schedule) is the primary defense against plan regression from statistics drift.
When should a plan cache entry be invalidated?
Plan cache entries should be invalidated on: (1) DDL changes to any table referenced in the query (column added/dropped, index created/dropped); (2) significant statistics drift detected after ANALYZE (e.g., row count changed by more than 20%); (3) adaptive invalidation when actual vs estimated row counts diverge by more than an order of magnitude across multiple executions. Invalidation resets the fingerprint entry; the next execution re-optimizes and stores a new plan.
Why does join order enumeration become exponential and how is it controlled?
For N tables, the number of left-deep join trees is N! and the number of bushy trees is larger still. Dynamic programming reduces this to O(2^N * N) by solving subproblems once and reusing them, but it still becomes impractical above ~12 tables. Postgres switches to the GEQO genetic algorithm above the geqo_threshold (default 12). For application queries, keeping JOIN counts below 8 ensures the optimizer can always find the true optimal plan.
What heuristics drive index recommendations from slow query logs?
The primary signal is a SeqScan on a large table (above a row count threshold) with a WHERE predicate on an unindexed column. Secondary signals are: high actual vs estimated row ratio (statistics mismatch), and queries that appear repeatedly in the slow log (high frequency). The recommended index targets the filter column; if multiple predicates exist, a composite index covering the highest-selectivity column first is recommended. Covering indexes (adding SELECT columns to the index) are recommended when the query is read-heavy and the column set is small.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does statistics staleness cause bad query plans?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The optimizer uses row count and histogram data to estimate selectivity. If a table has grown from 10,000 rows to 10 million rows since the last ANALYZE, the optimizer still plans for 10,000 rows. It may choose a nested loop join efficient for small tables but catastrophic at scale, or skip an index that would filter 99% of rows. Regular ANALYZE triggered by autovacuum or explicit schedule is the primary defense against plan regression from statistics drift.”
}
},
{
“@type”: “Question”,
“name”: “When should a plan cache entry be invalidated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Plan cache entries should be invalidated on: DDL changes to any table referenced in the query; significant statistics drift detected after ANALYZE; and adaptive invalidation when actual vs estimated row counts diverge by more than an order of magnitude across multiple executions. Invalidation resets the fingerprint entry; the next execution re-optimizes and stores a new plan.”
}
},
{
“@type”: “Question”,
“name”: “Why does join order enumeration become exponential and how is it controlled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For N tables, the number of left-deep join trees is N!. Dynamic programming reduces this to O(2^N * N) by solving subproblems once, but it becomes impractical above ~12 tables. Postgres switches to the GEQO genetic algorithm above the geqo_threshold (default 12). Keeping JOIN counts below 8 ensures the optimizer can always find the true optimal plan.”
}
},
{
“@type”: “Question”,
“name”: “What heuristics drive index recommendations from slow query logs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The primary signal is a SeqScan on a large table with a WHERE predicate on an unindexed column. Secondary signals are high actual vs estimated row ratio and high frequency in the slow log. The recommended index targets the filter column; a composite index covers the highest-selectivity column first. Covering indexes are recommended for read-heavy queries with a small SELECT column set.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the cost model estimate query cost?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each plan node has an I/O cost (pages read * page_cost) and CPU cost (rows processed * cpu_tuple_cost); these are summed bottom-up through the plan tree; the optimizer selects the plan with lowest total cost.”
}
},
{
“@type”: “Question”,
“name”: “How is join order selected for multi-table queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For N tables, dynamic programming enumerates all join orderings up to a threshold (e.g., 8 tables); above the threshold, a greedy heuristic joins tables in order of increasing estimated output rows.”
}
},
{
“@type”: “Question”,
“name”: “How does plan caching work and when is it invalidated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The query fingerprint (normalized SQL without literals) is the cache key; cached plans are invalidated when table statistics are refreshed (ANALYZE) or when schema changes affect the plan's validity.”
}
},
{
“@type”: “Question”,
“name”: “How are index recommendations generated from slow queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The optimizer identifies predicates in slow query WHERE clauses with high table scan cost; it checks whether a single-column or composite index on those columns would reduce the estimated rows scanned below a selectivity threshold.”
}
}
]
}
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering