What is a data warehouse and how does it differ from a database?

A data warehouse is a centralized repository designed for analytical query workloads (OLAP), aggregating large volumes of historical data from multiple source systems. Unlike an operational database (OLTP), which is optimized for low-latency reads and writes on current records, a data warehouse is optimized for complex aggregations, full-table scans, and reporting across years of data. Data warehouses are typically column-oriented, denormalized, and updated in batch loads rather than row-by-row transactions.

What is a star schema and why is it used in data warehouses?

A star schema organizes warehouse data into a central fact table surrounded by dimension tables. The fact table stores measurable events (e.g., sales, clicks) with foreign keys to dimensions (e.g., date, product, customer). It's used in data warehouses because it simplifies queries, reduces the number of joins needed, and is highly compatible with columnar storage engines. The denormalized structure trades storage for query speed, which is the right tradeoff in analytical workloads where reads dominate.

How does partition pruning improve data warehouse query performance?

Partition pruning allows the query engine to skip entire partitions of a table that cannot contain rows matching a WHERE clause predicate. For example, if a table is partitioned by date and a query filters on a single month, the engine reads only that month's partition files instead of scanning the full dataset. This dramatically reduces I/O and speeds up queries. Effective pruning requires that the partition key appears in the filter predicate and that statistics or metadata are available for the engine to identify irrelevant partitions at plan time.

How does a data warehouse handle slowly changing dimensions (SCD)?

Slowly changing dimensions (SCD) track how dimension attributes change over time. The most common strategies are: SCD Type 1, which overwrites the old value and loses history; SCD Type 2, which inserts a new row with a new surrogate key and validity date range, preserving full history; and SCD Type 3, which adds a column for the previous value, retaining only one prior state. SCD Type 2 is the most widely used in practice because it lets fact rows point to the exact dimension snapshot that was current at the time of the event, enabling accurate historical reporting.

Low Level Design: Data Warehouse

⏱ 9 min read

A data warehouse (DWH) is an analytical store that consolidates data from multiple operational systems, optimizes it for read-heavy analytical queries, and exposes it to BI tools and data scientists. This post walks through every layer of a production-grade design: schema modeling, ingestion pipeline, storage engine choices, query optimization, and access control.

Requirements

Functional

Ingest structured data from OLTP databases, event streams, and third-party APIs.
Support SQL analytics against billions of rows with sub-second to low-minute latency.
Provide historical point-in-time queries (slowly changing dimensions).
Enforce row-level and column-level security per team / role.

Non-Functional

Throughput: ingest 500 GB/day of raw data.
Storage: retain 5 years of history (multi-petabyte scale).
Query SLA: 95th-percentile analytical query < 30 s on 1 TB scans.
Availability: 99.9% for query serving; ingestion can tolerate short delays.

Data Model — Star Schema

The star schema separates measurements (fact tables) from descriptive context (dimension tables). One central fact table joins to several dimension tables, keeping joins shallow and query plans simple.

-- Dimension: date
CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,          -- YYYYMMDD surrogate
    full_date       DATE        NOT NULL,
    year            SMALLINT    NOT NULL,
    quarter         SMALLINT    NOT NULL,
    month           SMALLINT    NOT NULL,
    week_of_year    SMALLINT    NOT NULL,
    day_of_week     SMALLINT    NOT NULL,
    is_weekend      BOOLEAN     NOT NULL,
    is_holiday      BOOLEAN     NOT NULL DEFAULT FALSE
);

-- Dimension: user (SCD Type 2)
CREATE TABLE dim_user (
    user_sk         BIGSERIAL   PRIMARY KEY,  -- surrogate key
    user_nk         BIGINT      NOT NULL,     -- natural / business key
    email           TEXT        NOT NULL,
    country_code    CHAR(2)     NOT NULL,
    plan            VARCHAR(32) NOT NULL,
    row_status      VARCHAR(8)  NOT NULL CHECK (row_status IN ('active','expired')),
    valid_from      TIMESTAMPTZ NOT NULL,
    valid_to        TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31',
    is_current      BOOLEAN     NOT NULL DEFAULT TRUE
);
CREATE INDEX idx_dim_user_nk ON dim_user(user_nk, is_current);

-- Dimension: product
CREATE TABLE dim_product (
    product_sk      INT         PRIMARY KEY,
    product_nk      INT         NOT NULL,
    sku             VARCHAR(64) NOT NULL,
    name            TEXT        NOT NULL,
    category        VARCHAR(128),
    brand           VARCHAR(128),
    unit_cost       NUMERIC(12,4),
    is_current      BOOLEAN     NOT NULL DEFAULT TRUE,
    valid_from      TIMESTAMPTZ NOT NULL,
    valid_to        TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31'
);

-- Fact: orders (append-only, immutable)
CREATE TABLE fact_orders (
    order_sk        BIGSERIAL   PRIMARY KEY,
    order_nk        BIGINT      NOT NULL,     -- source system order ID
    date_key        INT         NOT NULL REFERENCES dim_date(date_key),
    user_sk         BIGINT      NOT NULL REFERENCES dim_user(user_sk),
    product_sk      INT         NOT NULL REFERENCES dim_product(product_sk),
    quantity        INT         NOT NULL,
    unit_price      NUMERIC(12,4) NOT NULL,
    discount_amt    NUMERIC(12,4) NOT NULL DEFAULT 0,
    gross_revenue   NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price) STORED,
    net_revenue     NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price - discount_amt) STORED,
    loaded_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
)
PARTITION BY RANGE (date_key);   -- monthly partitions

-- Monthly partition example
CREATE TABLE fact_orders_202501
    PARTITION OF fact_orders
    FOR VALUES FROM (20250101) TO (20250201);

Slowly Changing Dimensions (SCD Type 2)

When a user upgrades their plan the old row gets valid_to = NOW() and is_current = FALSE; a new row is inserted with the updated plan. The fact table always stores the surrogate key (user_sk) at the time of the event, preserving historical accuracy without rewriting fact rows.

ETL / ELT Ingestion Pipeline

Source Systems
    |
    v
[Extract Layer]
  - CDC from Postgres via Debezium (binlog streaming → Kafka topic)
  - REST API pollers (cron-based micro-batch, 5-min cadence)
  - File drops (S3 landing zone, event-triggered)
    |
    v
[Raw / Bronze Layer]  (append-only Parquet on S3 / GCS)
  - Schema-on-read; no transformations
  - Retained 90 days for reprocessing
    |
    v
[Transform / Silver Layer]  (dbt models or Spark jobs)
  - Deduplicate by (source_id, event_ts)
  - Cast types, validate nulls & ranges
  - Resolve surrogate keys (dim lookups)
  - Apply SCD Type 2 logic for dimensions
    |
    v
[Serve / Gold Layer]  (columnar warehouse: BigQuery / Snowflake / Redshift)
  - Star-schema tables loaded via COPY or streaming insert
  - Materialized aggregates (daily/weekly roll-ups)
    |
    v
[BI Tools / Ad-hoc SQL / Data Science]

Idempotent Loads

Every load job writes to a staging table first, then runs a MERGE (upsert) into the target. The merge key for fact tables is order_nk. This makes reruns safe: duplicate messages from Kafka produce no duplicate rows.

-- Idempotent upsert pattern (BigQuery-style)
MERGE fact_orders AS target
USING staging_fact_orders AS src
  ON target.order_nk = src.order_nk
WHEN NOT MATCHED THEN
  INSERT (order_nk, date_key, user_sk, product_sk, quantity, unit_price, discount_amt, loaded_at)
  VALUES (src.order_nk, src.date_key, src.user_sk, src.product_sk,
          src.quantity, src.unit_price, src.discount_amt, NOW());

Storage Engine & Physical Design

Columnar Storage

Analytical queries typically read 3–5 columns out of 50+. Row-oriented storage reads entire rows; columnar formats (Parquet, ORC) read only the requested columns. Compression ratios for numeric columns routinely reach 10:1 with dictionary encoding and run-length encoding (RLE).

Partition Pruning

Partitioning fact tables by date_key (monthly) lets the query planner skip entire files when a WHERE clause constrains the date range. A query for January 2025 touches only the January partition file set rather than the full table.

Clustering / Sort Keys

Within each partition, rows are sorted by (user_sk, product_sk). Queries filtering on user or product scan far fewer row groups because the min/max statistics in Parquet row-group footers eliminate non-matching groups without reading data.

Materialized Aggregates

CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
    date_key,
    product_sk,
    SUM(gross_revenue)  AS total_gross,
    SUM(net_revenue)    AS total_net,
    COUNT(DISTINCT user_sk) AS unique_buyers
FROM fact_orders
GROUP BY date_key, product_sk;
-- Refreshed nightly; serves 80% of dashboard queries

Query Optimizer Design

Statistics collection: run ANALYZE after each major load to refresh column histograms and NDV (number of distinct values) estimates.
Join ordering: always join the large fact table last; broadcast small dimension tables (< 100 MB) to all workers to avoid shuffle.
Predicate pushdown: push WHERE filters into the scan operator before any join or aggregation. The columnar reader uses row-group min/max statistics to skip blocks entirely.
Result caching: cache deterministic query results keyed by SQL hash + partition snapshot ID. Dashboard re-renders hit the cache rather than the warehouse engine.

Access Control

-- Role hierarchy
CREATE ROLE analyst;
CREATE ROLE senior_analyst IN ROLE analyst;
CREATE ROLE data_engineer  IN ROLE senior_analyst;

-- Column-level: hide PII from analysts
GRANT SELECT (order_sk, date_key, product_sk, gross_revenue, net_revenue)
    ON fact_orders TO analyst;

-- Row-level security: each team sees only their region
CREATE POLICY region_isolation ON fact_orders
    USING (region_code = current_setting('app.user_region'));
ALTER TABLE fact_orders ENABLE ROW LEVEL SECURITY;

Key Design Decisions & Trade-offs

Decision	Choice	Alternative	Reason
Schema style	Star schema	Data vault / OBT	Simpler queries, easier BI tool integration
SCD strategy	Type 2 (new rows)	Type 1 (overwrite)	Preserves full history for auditing
Ingestion mode	CDC + micro-batch	Full nightly dump	Fresher data, less I/O on source DBs
Storage format	Parquet (columnar)	CSV / JSON	10x compression, 5–100x faster analytic scans
Partition key	date_key (month)	hash partitioning	Query patterns are almost always time-bounded

Failure Handling & Edge Cases

Late-arriving data: CDC events can arrive out of order. The silver layer deduplicates on (source_id, event_ts) and uses the highest event_ts seen per key. Partitions older than 24 h are re-merged nightly to absorb late records.
Schema drift: source tables add columns over time. The raw layer stores JSON blobs alongside typed columns; a schema registry (Confluent / AWS Glue) tracks versions and emits alerts on breaking changes.

Pipeline failure mid-load:

Surrogate key resolution failure: if a fact row references an unknown dimension key the row is routed to a quarantine table with an error code for manual triage rather than failing the entire batch.
Partition explosion: high-cardinality partition keys (e.g., user_id) create millions of small files. Enforce time-based partition keys only; use clustering/sort keys for high-cardinality columns.

Scalability Considerations

Horizontal compute scaling: separate storage (S3/GCS) from compute (Spark / BigQuery slots). Spin up workers only during transformation windows; idle clusters cost nothing.
Incremental models: dbt incremental models process only new/changed source rows rather than full table scans, keeping transform runtimes sub-linear as data grows.
Compaction jobs: small-file accumulation degrades scan performance. A nightly compaction job rewrites each partition into a small number of optimally-sized Parquet files (128–256 MB each).
Tiered storage: move partitions older than 2 years to cheaper object storage tiers (e.g., S3 Glacier Instant Retrieval). Query routing logic skips the cold tier unless the query explicitly requests historical ranges.
Concurrency control: a query queue with priority classes (dashboard > ad-hoc > batch export) prevents runaway queries from starving BI dashboards. Slot reservations guarantee minimum throughput per team.

Summary

A production data warehouse rests on four pillars: a star schema that keeps analytical SQL simple, a CDC-driven ELT pipeline that delivers fresh data with idempotent loads, columnar partitioned storage that makes large scans fast, and layered access control that enforces data governance. The hardest operational challenges are late-arriving data, schema drift, and small-file accumulation — each requires an explicit mitigation strategy built into the pipeline from day one. Get those right and the warehouse scales from gigabytes to petabytes with minimal architectural change.