A data warehouse (DWH) is an analytical store that consolidates data from multiple operational systems, optimizes it for read-heavy analytical queries, and exposes it to BI tools and data scientists. This post walks through every layer of a production-grade design: schema modeling, ingestion pipeline, storage engine choices, query optimization, and access control.
Requirements
Functional
- Ingest structured data from OLTP databases, event streams, and third-party APIs.
- Support SQL analytics against billions of rows with sub-second to low-minute latency.
- Provide historical point-in-time queries (slowly changing dimensions).
- Enforce row-level and column-level security per team / role.
Non-Functional
- Throughput: ingest 500 GB/day of raw data.
- Storage: retain 5 years of history (multi-petabyte scale).
- Query SLA: 95th-percentile analytical query < 30 s on 1 TB scans.
- Availability: 99.9% for query serving; ingestion can tolerate short delays.
Data Model — Star Schema
The star schema separates measurements (fact tables) from descriptive context (dimension tables). One central fact table joins to several dimension tables, keeping joins shallow and query plans simple.
-- Dimension: date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD surrogate
full_date DATE NOT NULL,
year SMALLINT NOT NULL,
quarter SMALLINT NOT NULL,
month SMALLINT NOT NULL,
week_of_year SMALLINT NOT NULL,
day_of_week SMALLINT NOT NULL,
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN NOT NULL DEFAULT FALSE
);
-- Dimension: user (SCD Type 2)
CREATE TABLE dim_user (
user_sk BIGSERIAL PRIMARY KEY, -- surrogate key
user_nk BIGINT NOT NULL, -- natural / business key
email TEXT NOT NULL,
country_code CHAR(2) NOT NULL,
plan VARCHAR(32) NOT NULL,
row_status VARCHAR(8) NOT NULL CHECK (row_status IN ('active','expired')),
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE INDEX idx_dim_user_nk ON dim_user(user_nk, is_current);
-- Dimension: product
CREATE TABLE dim_product (
product_sk INT PRIMARY KEY,
product_nk INT NOT NULL,
sku VARCHAR(64) NOT NULL,
name TEXT NOT NULL,
category VARCHAR(128),
brand VARCHAR(128),
unit_cost NUMERIC(12,4),
is_current BOOLEAN NOT NULL DEFAULT TRUE,
valid_from TIMESTAMPTZ NOT NULL,
valid_to TIMESTAMPTZ NOT NULL DEFAULT '9999-12-31'
);
-- Fact: orders (append-only, immutable)
CREATE TABLE fact_orders (
order_sk BIGSERIAL PRIMARY KEY,
order_nk BIGINT NOT NULL, -- source system order ID
date_key INT NOT NULL REFERENCES dim_date(date_key),
user_sk BIGINT NOT NULL REFERENCES dim_user(user_sk),
product_sk INT NOT NULL REFERENCES dim_product(product_sk),
quantity INT NOT NULL,
unit_price NUMERIC(12,4) NOT NULL,
discount_amt NUMERIC(12,4) NOT NULL DEFAULT 0,
gross_revenue NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price) STORED,
net_revenue NUMERIC(14,4) GENERATED ALWAYS AS (quantity * unit_price - discount_amt) STORED,
loaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
)
PARTITION BY RANGE (date_key); -- monthly partitions
-- Monthly partition example
CREATE TABLE fact_orders_202501
PARTITION OF fact_orders
FOR VALUES FROM (20250101) TO (20250201);
Slowly Changing Dimensions (SCD Type 2)
When a user upgrades their plan the old row gets valid_to = NOW() and is_current = FALSE; a new row is inserted with the updated plan. The fact table always stores the surrogate key (user_sk) at the time of the event, preserving historical accuracy without rewriting fact rows.
ETL / ELT Ingestion Pipeline
Source Systems
|
v
[Extract Layer]
- CDC from Postgres via Debezium (binlog streaming → Kafka topic)
- REST API pollers (cron-based micro-batch, 5-min cadence)
- File drops (S3 landing zone, event-triggered)
|
v
[Raw / Bronze Layer] (append-only Parquet on S3 / GCS)
- Schema-on-read; no transformations
- Retained 90 days for reprocessing
|
v
[Transform / Silver Layer] (dbt models or Spark jobs)
- Deduplicate by (source_id, event_ts)
- Cast types, validate nulls & ranges
- Resolve surrogate keys (dim lookups)
- Apply SCD Type 2 logic for dimensions
|
v
[Serve / Gold Layer] (columnar warehouse: BigQuery / Snowflake / Redshift)
- Star-schema tables loaded via COPY or streaming insert
- Materialized aggregates (daily/weekly roll-ups)
|
v
[BI Tools / Ad-hoc SQL / Data Science]
Idempotent Loads
Every load job writes to a staging table first, then runs a MERGE (upsert) into the target. The merge key for fact tables is order_nk. This makes reruns safe: duplicate messages from Kafka produce no duplicate rows.
-- Idempotent upsert pattern (BigQuery-style)
MERGE fact_orders AS target
USING staging_fact_orders AS src
ON target.order_nk = src.order_nk
WHEN NOT MATCHED THEN
INSERT (order_nk, date_key, user_sk, product_sk, quantity, unit_price, discount_amt, loaded_at)
VALUES (src.order_nk, src.date_key, src.user_sk, src.product_sk,
src.quantity, src.unit_price, src.discount_amt, NOW());
Storage Engine & Physical Design
Columnar Storage
Analytical queries typically read 3–5 columns out of 50+. Row-oriented storage reads entire rows; columnar formats (Parquet, ORC) read only the requested columns. Compression ratios for numeric columns routinely reach 10:1 with dictionary encoding and run-length encoding (RLE).
Partition Pruning
Partitioning fact tables by date_key (monthly) lets the query planner skip entire files when a WHERE clause constrains the date range. A query for January 2025 touches only the January partition file set rather than the full table.
Clustering / Sort Keys
Within each partition, rows are sorted by (user_sk, product_sk). Queries filtering on user or product scan far fewer row groups because the min/max statistics in Parquet row-group footers eliminate non-matching groups without reading data.
Materialized Aggregates
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT
date_key,
product_sk,
SUM(gross_revenue) AS total_gross,
SUM(net_revenue) AS total_net,
COUNT(DISTINCT user_sk) AS unique_buyers
FROM fact_orders
GROUP BY date_key, product_sk;
-- Refreshed nightly; serves 80% of dashboard queries
Query Optimizer Design
- Statistics collection: run ANALYZE after each major load to refresh column histograms and NDV (number of distinct values) estimates.
- Join ordering: always join the large fact table last; broadcast small dimension tables (< 100 MB) to all workers to avoid shuffle.
- Predicate pushdown: push WHERE filters into the scan operator before any join or aggregation. The columnar reader uses row-group min/max statistics to skip blocks entirely.
- Result caching: cache deterministic query results keyed by SQL hash + partition snapshot ID. Dashboard re-renders hit the cache rather than the warehouse engine.
Access Control
-- Role hierarchy
CREATE ROLE analyst;
CREATE ROLE senior_analyst IN ROLE analyst;
CREATE ROLE data_engineer IN ROLE senior_analyst;
-- Column-level: hide PII from analysts
GRANT SELECT (order_sk, date_key, product_sk, gross_revenue, net_revenue)
ON fact_orders TO analyst;
-- Row-level security: each team sees only their region
CREATE POLICY region_isolation ON fact_orders
USING (region_code = current_setting('app.user_region'));
ALTER TABLE fact_orders ENABLE ROW LEVEL SECURITY;
Key Design Decisions & Trade-offs
| Decision | Choice | Alternative | Reason |
|---|---|---|---|
| Schema style | Star schema | Data vault / OBT | Simpler queries, easier BI tool integration |
| SCD strategy | Type 2 (new rows) | Type 1 (overwrite) | Preserves full history for auditing |
| Ingestion mode | CDC + micro-batch | Full nightly dump | Fresher data, less I/O on source DBs |
| Storage format | Parquet (columnar) | CSV / JSON | 10x compression, 5–100x faster analytic scans |
| Partition key | date_key (month) | hash partitioning | Query patterns are almost always time-bounded |
Failure Handling & Edge Cases
- Late-arriving data: CDC events can arrive out of order. The silver layer deduplicates on
(source_id, event_ts)and uses the highestevent_tsseen per key. Partitions older than 24 h are re-merged nightly to absorb late records. - Schema drift: source tables add columns over time. The raw layer stores JSON blobs alongside typed columns; a schema registry (Confluent / AWS Glue) tracks versions and emits alerts on breaking changes.
- Surrogate key resolution failure: if a fact row references an unknown dimension key the row is routed to a quarantine table with an error code for manual triage rather than failing the entire batch.
- Partition explosion: high-cardinality partition keys (e.g., user_id) create millions of small files. Enforce time-based partition keys only; use clustering/sort keys for high-cardinality columns.
- Horizontal compute scaling: separate storage (S3/GCS) from compute (Spark / BigQuery slots). Spin up workers only during transformation windows; idle clusters cost nothing.
- Incremental models: dbt incremental models process only new/changed source rows rather than full table scans, keeping transform runtimes sub-linear as data grows.
- Compaction jobs: small-file accumulation degrades scan performance. A nightly compaction job rewrites each partition into a small number of optimally-sized Parquet files (128–256 MB each).
- Tiered storage: move partitions older than 2 years to cheaper object storage tiers (e.g., S3 Glacier Instant Retrieval). Query routing logic skips the cold tier unless the query explicitly requests historical ranges.
- Concurrency control: a query queue with priority classes (dashboard > ad-hoc > batch export) prevents runaway queries from starving BI dashboards. Slot reservations guarantee minimum throughput per team.
Scalability Considerations
Summary
A production data warehouse rests on four pillars: a star schema that keeps analytical SQL simple, a CDC-driven ELT pipeline that delivers fresh data with idempotent loads, columnar partitioned storage that makes large scans fast, and layered access control that enforces data governance. The hardest operational challenges are late-arriving data, schema drift, and small-file accumulation — each requires an explicit mitigation strategy built into the pipeline from day one. Get those right and the warehouse scales from gigabytes to petabytes with minimal architectural change.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a data warehouse and how does it differ from a database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A data warehouse is a centralized repository designed for analytical query workloads (OLAP), aggregating large volumes of historical data from multiple source systems. Unlike an operational database (OLTP), which is optimized for low-latency reads and writes on current records, a data warehouse is optimized for complex aggregations, full-table scans, and reporting across years of data. Data warehouses are typically column-oriented, denormalized, and updated in batch loads rather than row-by-row transactions.”
}
},
{
“@type”: “Question”,
“name”: “What is a star schema and why is it used in data warehouses?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A star schema organizes warehouse data into a central fact table surrounded by dimension tables. The fact table stores measurable events (e.g., sales, clicks) with foreign keys to dimensions (e.g., date, product, customer). It’s used in data warehouses because it simplifies queries, reduces the number of joins needed, and is highly compatible with columnar storage engines. The denormalized structure trades storage for query speed, which is the right tradeoff in analytical workloads where reads dominate.”
}
},
{
“@type”: “Question”,
“name”: “How does partition pruning improve data warehouse query performance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Partition pruning allows the query engine to skip entire partitions of a table that cannot contain rows matching a WHERE clause predicate. For example, if a table is partitioned by date and a query filters on a single month, the engine reads only that month’s partition files instead of scanning the full dataset. This dramatically reduces I/O and speeds up queries. Effective pruning requires that the partition key appears in the filter predicate and that statistics or metadata are available for the engine to identify irrelevant partitions at plan time.”
}
},
{
“@type”: “Question”,
“name”: “How does a data warehouse handle slowly changing dimensions (SCD)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Slowly changing dimensions (SCD) track how dimension attributes change over time. The most common strategies are: SCD Type 1, which overwrites the old value and loses history; SCD Type 2, which inserts a new row with a new surrogate key and validity date range, preserving full history; and SCD Type 3, which adds a column for the previous value, retaining only one prior state. SCD Type 2 is the most widely used in practice because it lets fact rows point to the exact dimension snapshot that was current at the time of the event, enabling accurate historical reporting.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture