Query Optimizer Low-Level Design: Cost Estimation, Plan Selection, Index Recommendation, and Query Rewriting

Query Optimizer Low-Level Design

A query optimizer transforms a parsed SQL query into an efficient execution plan. The optimizer evaluates many possible plans (different join orders, index choices, access methods) and selects the one with the lowest estimated cost. The quality of the cost estimate determines plan quality: bad statistics produce bad plans.

Statistics Collection

The optimizer relies on table and column statistics to estimate row counts and selectivity:

Table statistics: row_count (total rows), page_count (disk pages), updated_at (staleness indicator).
Column statistics: cardinality (distinct value count), null_fraction, and a histogram of value distribution (equal-width or equal-depth buckets). Histograms allow the optimizer to estimate selectivity for range predicates like WHERE age BETWEEN 30 AND 40.

Statistics are collected by a background ANALYZE job triggered after a threshold of row changes (e.g., 10% of table size). Stale statistics are the most common cause of bad query plans in production.

Cost Model

Each plan node has an estimated cost composed of:

I/O cost: estimated disk pages read. Sequential scans read all pages; index scans read index pages plus heap pages for matched rows.
CPU cost: estimated tuple processing cost. Joins, filters, and projections each have a per-tuple CPU coefficient.

Total plan cost = sum of (I/O cost + CPU cost) across all nodes in the plan tree. The optimizer selects the plan with the minimum total cost.

Join Order Enumeration

For N tables in a query, the number of possible join orders is O(N!). Optimizers use different strategies based on N:

Small N (up to ~8 tables): dynamic programming over subsets. The optimizer computes the optimal join order for every subset of tables bottom-up, reusing sub-results. This is the System R / Selinger algorithm.
Large N: heuristics such as greedy nearest-neighbor (join the table that produces the smallest intermediate result at each step) or genetic algorithm (GEQO in Postgres). These sacrifice optimality for planning time.

Index Selection

For each table access, the optimizer considers available indexes:

Selectivity: an index is beneficial when the predicate selects a small fraction of rows. If selectivity is above ~10-20%, a sequential scan is cheaper due to sequential I/O vs random I/O.
Covering index: if all columns referenced by the query are in the index, the optimizer can perform an index-only scan, skipping heap access entirely. This is the most impactful index optimization for read-heavy queries.
Multi-column index: column order matters. The optimizer uses the index only if predicates match the leading columns.

Predicate Pushdown

Predicate pushdown moves WHERE filter conditions as close to the table scan as possible in the plan tree. This reduces the number of rows flowing up through joins and aggregations, directly reducing I/O and CPU cost. For example, a filter on a joined table's column is pushed below the join into the table scan of that table.

Plan Caching

Parsing and optimizing a query is CPU-intensive. The plan cache stores the execution plan for a normalized query fingerprint (parameter values replaced with placeholders). On re-execution, the fingerprint is looked up in the cache and the cached plan is executed directly. Plan cache invalidation is triggered by schema changes (DDL) or significant statistics drift (after ANALYZE). Adaptive plan caching re-optimizes plans that have been running with a significantly different actual vs estimated row count.

Slow Query Detection and Plan Capture

Queries exceeding an execution time threshold are logged to the SlowQueryLog with the full execution plan and actual vs estimated row counts. This data drives index recommendations and manual plan tuning. Comparing estimated vs actual rows at each node exposes where the optimizer's statistics model breaks down.

SQL Schema


CREATE TABLE TableStatistic (
    table_name   TEXT PRIMARY KEY,
    row_count    BIGINT NOT NULL,
    page_count   BIGINT NOT NULL,
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE ColumnStatistic (
    id           BIGSERIAL PRIMARY KEY,
    table_name   TEXT NOT NULL,
    column_name  TEXT NOT NULL,
    cardinality  BIGINT NOT NULL,
    null_fraction NUMERIC(5,4) NOT NULL DEFAULT 0,
    histogram    JSONB,
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (table_name, column_name)
);

CREATE TABLE QueryPlanCache (
    query_fingerprint TEXT PRIMARY KEY,
    plan             JSONB NOT NULL,
    estimated_cost   NUMERIC NOT NULL,
    hit_count        BIGINT NOT NULL DEFAULT 0,
    created_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    invalidated_at   TIMESTAMPTZ
);

CREATE TABLE SlowQueryLog (
    id           BIGSERIAL PRIMARY KEY,
    query_hash   TEXT NOT NULL,
    sql_text     TEXT NOT NULL,
    plan         JSONB NOT NULL,
    execution_ms INTEGER NOT NULL,
    captured_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON SlowQueryLog (execution_ms DESC);
CREATE INDEX ON SlowQueryLog (query_hash, captured_at DESC);

Python Implementation


import json
from dataclasses import dataclass
from typing import List, Dict, Any, Optional

@dataclass
class PlanNode:
    node_type: str       # SeqScan, IndexScan, HashJoin, etc.
    table_name: str
    estimated_rows: float
    estimated_cost: float
    children: List["PlanNode"] = None

    def __post_init__(self):
        if self.children is None:
            self.children = []

def estimate_cost(plan_node: PlanNode, stats: Dict[str, Any]) -> float:
    SEQ_PAGE_COST = 1.0
    RANDOM_PAGE_COST = 4.0
    CPU_TUPLE_COST = 0.01
    CPU_OPERATOR_COST = 0.0025

    ts = stats.get(plan_node.table_name, {})
    page_count = ts.get("page_count", 1000)
    row_count = ts.get("row_count", 10000)

    if plan_node.node_type == "SeqScan":
        io_cost = page_count * SEQ_PAGE_COST
        cpu_cost = row_count * CPU_TUPLE_COST
    elif plan_node.node_type == "IndexScan":
        matched_rows = plan_node.estimated_rows
        io_cost = matched_rows * RANDOM_PAGE_COST
        cpu_cost = matched_rows * (CPU_TUPLE_COST + CPU_OPERATOR_COST)
    elif plan_node.node_type in ("HashJoin", "NestedLoopJoin", "MergeJoin"):
        child_costs = sum(estimate_cost(c, stats) for c in plan_node.children)
        join_rows = plan_node.estimated_rows
        cpu_cost = join_rows * CPU_OPERATOR_COST
        io_cost = 0
        return child_costs + cpu_cost
    else:
        io_cost = 0
        cpu_cost = plan_node.estimated_rows * CPU_TUPLE_COST

    child_costs = sum(estimate_cost(c, stats) for c in plan_node.children)
    return io_cost + cpu_cost + child_costs

def select_join_order(tables: List[str], predicates: List[Dict], stats: Dict) -> List[str]:
    """Greedy nearest-neighbor heuristic: join smallest intermediate result first."""
    if not tables:
        return []
    remaining = list(tables)
    order = [remaining.pop(0)]
    while remaining:
        best = min(remaining, key=lambda t: stats.get(t, {}).get("row_count", float("inf")))
        order.append(best)
        remaining.remove(best)
    return order

def recommend_index(slow_query: Dict) -> Optional[str]:
    """
    Heuristic: if the plan contains a SeqScan on a large table with a WHERE predicate,
    recommend an index on the filtered column.
    """
    plan = slow_query.get("plan", {})
    table = plan.get("table_name")
    node_type = plan.get("node_type")
    filter_col = plan.get("filter_column")
    estimated_rows = plan.get("estimated_rows", 0)

    if node_type == "SeqScan" and estimated_rows > 10000 and filter_col:
        return f"CREATE INDEX ON {table} ({filter_col});"
    return None

Frequently Asked Questions

How does statistics staleness cause bad query plans?

The optimizer uses row count and histogram data to estimate selectivity. If a table has grown from 10,000 rows to 10 million rows since the last ANALYZE, the optimizer still plans for 10,000 rows. It may choose a nested loop join that is efficient for small tables but catastrophic at scale, or skip an index that would filter 99% of rows. Regular ANALYZE (triggered by autovacuum or explicit schedule) is the primary defense against plan regression from statistics drift.

When should a plan cache entry be invalidated?

Plan cache entries should be invalidated on: (1) DDL changes to any table referenced in the query (column added/dropped, index created/dropped); (2) significant statistics drift detected after ANALYZE (e.g., row count changed by more than 20%); (3) adaptive invalidation when actual vs estimated row counts diverge by more than an order of magnitude across multiple executions. Invalidation resets the fingerprint entry; the next execution re-optimizes and stores a new plan.

Why does join order enumeration become exponential and how is it controlled?

For N tables, the number of left-deep join trees is N! and the number of bushy trees is larger still. Dynamic programming reduces this to O(2^N * N) by solving subproblems once and reusing them, but it still becomes impractical above ~12 tables. Postgres switches to the GEQO genetic algorithm above the geqo_threshold (default 12). For application queries, keeping JOIN counts below 8 ensures the optimizer can always find the true optimal plan.

What heuristics drive index recommendations from slow query logs?

The primary signal is a SeqScan on a large table (above a row count threshold) with a WHERE predicate on an unindexed column. Secondary signals are: high actual vs estimated row ratio (statistics mismatch), and queries that appear repeatedly in the slow log (high frequency). The recommended index targets the filter column; if multiple predicates exist, a composite index covering the highest-selectivity column first is recommended. Covering indexes (adding SELECT columns to the index) are recommended when the query is read-heavy and the column set is small.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does statistics staleness cause bad query plans?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The optimizer uses row count and histogram data to estimate selectivity. If a table has grown from 10,000 rows to 10 million rows since the last ANALYZE, the optimizer still plans for 10,000 rows. It may choose a nested loop join efficient for small tables but catastrophic at scale, or skip an index that would filter 99% of rows. Regular ANALYZE triggered by autovacuum or explicit schedule is the primary defense against plan regression from statistics drift.”
}
},
{
“@type”: “Question”,
“name”: “When should a plan cache entry be invalidated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Plan cache entries should be invalidated on: DDL changes to any table referenced in the query; significant statistics drift detected after ANALYZE; and adaptive invalidation when actual vs estimated row counts diverge by more than an order of magnitude across multiple executions. Invalidation resets the fingerprint entry; the next execution re-optimizes and stores a new plan.”
}
},
{
“@type”: “Question”,
“name”: “Why does join order enumeration become exponential and how is it controlled?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For N tables, the number of left-deep join trees is N!. Dynamic programming reduces this to O(2^N * N) by solving subproblems once, but it becomes impractical above ~12 tables. Postgres switches to the GEQO genetic algorithm above the geqo_threshold (default 12). Keeping JOIN counts below 8 ensures the optimizer can always find the true optimal plan.”
}
},
{
“@type”: “Question”,
“name”: “What heuristics drive index recommendations from slow query logs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The primary signal is a SeqScan on a large table with a WHERE predicate on an unindexed column. Secondary signals are high actual vs estimated row ratio and high frequency in the slow log. The recommended index targets the filter column; a composite index covers the highest-selectivity column first. Covering indexes are recommended for read-heavy queries with a small SELECT column set.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the cost model estimate query cost?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each plan node has an I/O cost (pages read * page_cost) and CPU cost (rows processed * cpu_tuple_cost); these are summed bottom-up through the plan tree; the optimizer selects the plan with lowest total cost.”
}
},
{
“@type”: “Question”,
“name”: “How is join order selected for multi-table queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For N tables, dynamic programming enumerates all join orderings up to a threshold (e.g., 8 tables); above the threshold, a greedy heuristic joins tables in order of increasing estimated output rows.”
}
},
{
“@type”: “Question”,
“name”: “How does plan caching work and when is it invalidated?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The query fingerprint (normalized SQL without literals) is the cache key; cached plans are invalidated when table statistics are refreshed (ANALYZE) or when schema changes affect the plan's validity.”
}
},
{
“@type”: “Question”,
“name”: “How are index recommendations generated from slow queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The optimizer identifies predicates in slow query WHERE clauses with high table scan cost; it checks whether a single-column or composite index on those columns would reduce the estimated rows scanned below a selectivity threshold.”
}
}
]
}