A data warehouse (DWH) is an analytical store that consolidates data from multiple operational systems, optimizes it for read-heavy analytical queries, and exposes it to BI tools and data scientists. This post walks through every layer of a production-grade design: schema modeling, ingestion pipeline, storage engine choices, query optimization, and access control.
Requirements
Functional
- Ingest structured data from OLTP databases, event streams, and third-party APIs.
- Support SQL analytics against billions of rows with sub-second to low-minute latency.
- Provide historical point-in-time queries (slowly changing dimensions).
- Enforce row-level and column-level security per team / role.
Non-Functional
- Throughput: ingest 500 GB/day of raw data.
- Storage: retain 5 years of history (multi-petabyte scale).
- Query SLA: 95th-percentile analytical query < 30 s on 1 TB scans.
- Availability: 99.9% for query serving; ingestion can tolerate short delays.
Data Model — Star Schema
The star schema separates measurements (fact tables) from descriptive context (dimension tables). One central fact table joins to several dimension tables, keeping joins shallow and query plans simple.
-- Dimension: date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD surrogate
full_date DATE NOT NULL,
year SMALLINT NOT NULL,
quarter SMALLINT NOT NULL,
month SMALLINT NOT NULL,
week_of_year SMALLINT NOT NULL,
day_of_week SMALLINT NOT NULL,
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN NOT NULL DEFAULT FALSE
);
-- Dimension: user (SCD Type 2)
CREATE TABLE dim_user (
user_sk BIGSERIAL PRIMARY KEY, -- surrogate key
user_nk BIGINT NOT NULL, -- natural / business key
email TEXT NOT NULL,
country_code CHAR(2) NOT NULL,
plan VARCHAR(32) NOT NULL,
row_status VARCHAR(8) NOT NULL CHECK (row_status IN ('active','expired')),
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE INDEX idx_dim_user_nk ON dim_user(user_nk, is_current);
-- Dimension: product
CREATE TABLE dim_product (
product_sk INT PRIMARY KEY,
product_nk INT NOT NULL,
sku VARCHAR(64) NOT NULL,
name TEXT NOT NULL,
category VARCHAR(128),
brand VARCHAR(128),
unit_cost NUMERIC(12,4),
is_current BOOLEAN NOT NULL DEFAULT TRUE,
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31'
);
-- Fact: orders (append-only, immutable)
CREATE TABLE fact_orders (
order_sk BIGSERIAL PRIMARY KEY,
order_nk BIGINT NOT NULL, -- source system order ID
date_key INT NOT NULL REFERENCES dim_date(date_key),
user_sk BIGINT NOT NULL REFERENCES dim_user(user_sk),
product_sk INT NOT NULL REFERENCES dim_product(product_sk),
quantity INT NOT NULL,
unit_price NUMERIC(12,4) NOT NULL,
discount_amt NUMERIC(12,4) NOT NULL DEFAULT 0,
gross_revenue NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price) STORED,
net_revenue NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price - discount_amt) STORED,
loaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
)
PARTITION BY RANGE (date_key); -- monthly partitions
-- Monthly partition example
CREATE TABLE fact_orders_202501
PARTITION OF fact_orders
FOR VALUES FROM (20250101) TO (20250201);
Slowly Changing Dimensions (SCD Type 2)
When a user upgrades their plan the old row gets valid_to = NOW() and is_current = FALSE; a new row is inserted with the updated plan. The fact table always stores the surrogate key (user_sk) at the time of the event, preserving historical accuracy without rewriting fact rows.
ETL / ELT Ingestion Pipeline
Source Systems
|
v
[Extract Layer]
- CDC from Postgres via Debezium (binlog streaming → Kafka topic)
- REST API pollers (cron-based micro-batch, 5-min cadence)
- File drops (S3 landing zone, event-triggered)
|
v
[Raw / Bronze Layer] (append-only Parquet on S3 / GCS)
- Schema-on-read; no transformations
- Retained 90 days for reprocessing
|
v
[Transform / Silver Layer] (dbt models or Spark jobs)
- Deduplicate by (source_id, event_ts)
- Cast types, validate nulls & ranges
- Resolve surrogate keys (dim lookups)
- Apply SCD Type 2 logic for dimensions
|
v
[Serve / Gold Layer] (columnar warehouse: BigQuery / Snowflake / Redshift)
- Star-schema tables loaded via COPY or streaming insert
- Materialized aggregates (daily/weekly roll-ups)
|
v
[BI Tools / Ad-hoc SQL / Data Science]
Idempotent Loads
Every load job writes to a staging table first, then runs a MERGE (upsert) into the target. The merge key for fact tables is order_nk. This makes reruns safe: duplicate messages from Kafka produce no duplicate rows.
-- Idempotent upsert pattern (BigQuery-style)
MERGE fact_orders AS target
USING staging_fact_orders AS src
ON target.order_nk = src.order_nk
WHEN NOT MATCHED THEN
INSERT (order_nk, date_key, user_sk, product_sk, quantity, unit_price, discount_amt, loaded_at)
VALUES (src.order_nk, src.date_key, src.user_sk, src.product_sk,
src.quantity, src.unit_price, src.discount_amt, NOW());
Storage Engine & Physical Design
Columnar Storage
Analytical queries typically read 3–5 columns out of 50+. Row-oriented storage reads entire rows; columnar formats (Parquet, ORC) read only the requested columns. Compression ratios for numeric columns routinely reach 10:1 with dictionary encoding and run-length encoding (RLE).
Partition Pruning
Partitioning fact tables by date_key (monthly) lets the query planner skip entire files when a WHERE clause constrains the date range. A query for January 2025 touches only the January partition file set rather than the full table.
Clustering / Sort Keys
Within each partition, rows are sorted by (user_sk, product_sk). Queries filtering on user or product scan far fewer row groups because the min/max statistics in Parquet row-group footers eliminate non-matching groups without reading data.
Materialized Aggregates
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
date_key,
product_sk,
SUM(gross_revenue) AS total_gross,
SUM(net_revenue) AS total_net,
COUNT(DISTINCT user_sk) AS unique_buyers
FROM fact_orders
GROUP BY date_key, product_sk;
-- Refreshed nightly; serves 80% of dashboard queries
Query Optimizer Design
- Statistics collection: run ANALYZE after each major load to refresh column histograms and NDV (number of distinct values) estimates.
- Join ordering: always join the large fact table last; broadcast small dimension tables (< 100 MB) to all workers to avoid shuffle.
- Predicate pushdown: push WHERE filters into the scan operator before any join or aggregation. The columnar reader uses row-group min/max statistics to skip blocks entirely.
- Result caching: cache deterministic query results keyed by SQL hash + partition snapshot ID. Dashboard re-renders hit the cache rather than the warehouse engine.
Access Control
-- Role hierarchy
CREATE ROLE analyst;
CREATE ROLE senior_analyst IN ROLE analyst;
CREATE ROLE data_engineer IN ROLE senior_analyst;
-- Column-level: hide PII from analysts
GRANT SELECT (order_sk, date_key, product_sk, gross_revenue, net_revenue)
ON fact_orders TO analyst;
-- Row-level security: each team sees only their region
CREATE POLICY region_isolation ON fact_orders
USING (region_code = current_setting('app.user_region'));
ALTER TABLE fact_orders ENABLE ROW LEVEL SECURITY;
Key Design Decisions & Trade-offs
| Decision | Choice | Alternative | Reason |
|---|---|---|---|
| Schema style | Star schema | Data vault / OBT | Simpler queries, easier BI tool integration |
| SCD strategy | Type 2 (new rows) | Type 1 (overwrite) | Preserves full history for auditing |
| Ingestion mode | CDC + micro-batch | Full nightly dump | Fresher data, less I/O on source DBs |
| Storage format | Parquet (columnar) | CSV / JSON | 10x compression, 5–100x faster analytic scans |
| Partition key | date_key (month) | hash partitioning | Query patterns are almost always time-bounded |
Failure Handling & Edge Cases
- Late-arriving data: CDC events can arrive out of order. The silver layer deduplicates on
(source_id, event_ts)and uses the highestevent_tsseen per key. Partitions older than 24 h are re-merged nightly to absorb late records. - Schema drift: source tables add columns over time. The raw layer stores JSON blobs alongside typed columns; a schema registry (Confluent / AWS Glue) tracks versions and emits alerts on breaking changes.
- Surrogate key resolution failure: if a fact row references an unknown dimension key the row is routed to a quarantine table with an error code for manual triage rather than failing the entire batch.
- Partition explosion: high-cardinality partition keys (e.g., user_id) create millions of small files. Enforce time-based partition keys only; use clustering/sort keys for high-cardinality columns.
- Horizontal compute scaling: separate storage (S3/GCS) from compute (Spark / BigQuery slots). Spin up workers only during transformation windows; idle clusters cost nothing.
- Incremental models: dbt incremental models process only new/changed source rows rather than full table scans, keeping transform runtimes sub-linear as data grows.
- Compaction jobs: small-file accumulation degrades scan performance. A nightly compaction job rewrites each partition into a small number of optimally-sized Parquet files (128–256 MB each).
- Tiered storage: move partitions older than 2 years to cheaper object storage tiers (e.g., S3 Glacier Instant Retrieval). Query routing logic skips the cold tier unless the query explicitly requests historical ranges.
- Concurrency control: a query queue with priority classes (dashboard > ad-hoc > batch export) prevents runaway queries from starving BI dashboards. Slot reservations guarantee minimum throughput per team.
Scalability Considerations
Summary
A production data warehouse rests on four pillars: a star schema that keeps analytical SQL simple, a CDC-driven ELT pipeline that delivers fresh data with idempotent loads, columnar partitioned storage that makes large scans fast, and layered access control that enforces data governance. The hardest operational challenges are late-arriving data, schema drift, and small-file accumulation — each requires an explicit mitigation strategy built into the pipeline from day one. Get those right and the warehouse scales from gigabytes to petabytes with minimal architectural change.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture