Low Level Design: Data Warehouse Design

A data warehouse is an analytical database optimized for read-heavy, aggregation-heavy workloads over large historical datasets. Unlike OLTP databases designed for transactional point lookups, a data warehouse stores denormalized fact and dimension tables in columnar format, enabling queries that scan billions of rows in seconds. This design appears in analytics platforms, business intelligence tools, and data lakehouse architectures.

Star Schema: Fact and Dimension Tables

The star schema organizes data into a central fact table (events with measurable quantities) surrounded by dimension tables (descriptive attributes). A fact table for e-commerce orders would store: order_id, customer_id, product_id, date_id, quantity, revenue. Dimension tables hold slowly-changing attributes: customer name/segment, product category, date hierarchy (year/quarter/month/week/day). Joins are simple star-shaped: one fact table joined to N dimension tables, avoiding multi-hop joins that kill query performance.

-- Star schema example
CREATE TABLE fact_orders (
  order_id     BIGINT,
  customer_id  INT,      -- FK to dim_customers
  product_id   INT,      -- FK to dim_products
  date_id      INT,      -- FK to dim_date
  quantity     INT,
  revenue_usd  DECIMAL(12,2),
  cost_usd     DECIMAL(12,2)
) WITH (orientation=column, compress=YES);  -- columnar storage

-- Dimension table
CREATE TABLE dim_customers (
  customer_id  INT PRIMARY KEY,
  name         VARCHAR(255),
  country      VARCHAR(64),
  segment      VARCHAR(32),   -- enterprise, smb, consumer
  acquired_at  DATE
);

-- Analytical query: revenue by country, last 90 days
SELECT c.country, SUM(f.revenue_usd) AS revenue
FROM fact_orders f
JOIN dim_customers c USING (customer_id)
JOIN dim_date d USING (date_id)
WHERE d.date >= CURRENT_DATE - 90
GROUP BY c.country ORDER BY revenue DESC;

Columnar Storage: Why It Dominates OLAP

Row-oriented storage (PostgreSQL, MySQL) reads entire rows from disk. For a query touching only 3 of 50 columns, 94% of I/O is wasted. Columnar storage (Parquet, BigQuery, Redshift, Snowflake) stores each column contiguously. The same query reads only the 3 relevant columns. Combined with dictionary encoding (replace repeated strings with integer codes) and run-length encoding (store repeat values as count + value), typical compression ratios are 5-10x. A 1TB row-oriented table becomes 100-200GB columnar, with proportional I/O savings.

Partitioning and Clustering

Partition fact tables by date (most queries filter on a time range). BigQuery and Snowflake use partition pruning: a WHERE clause on the partition column eliminates entire partition files without scanning them. Within partitions, cluster by high-cardinality filter columns (e.g., CLUSTER BY customer_id, product_id). Clustering co-locates rows with the same dimension values, improving filter selectivity. Together, partitioning + clustering can reduce query cost by 10-100x for typical BI workloads.

ELT Pipeline: Ingestion Architecture

Modern data warehouses use ELT (Extract, Load, Transform) rather than ETL. Raw data lands in object storage (S3/GCS) first, then gets loaded into the warehouse in its raw form, then transformed with SQL using dbt or Dataform. The pipeline: OLTP databases → CDC (Debezium) → Kafka → Flink (deduplication, schema normalization) → S3 (Parquet files) → warehouse COPY command → staging tables → dbt transformation → production tables. This separates ingestion concerns from transformation, enabling replay and debugging.

Slowly Changing Dimensions (SCD Type 2)

Customer attributes change over time (customer moves country, changes segment). SCD Type 2 handles this by adding new rows rather than updating: add valid_from, valid_to, and is_current columns. When a customer moves from USA to UK, the old row gets valid_to = today, is_current = FALSE, and a new row is inserted with valid_from = today, is_current = TRUE. Historical fact rows correctly join to the dimension row that was current at the time of the event, preserving historical accuracy.

Materialized Views and Aggregation Tables

Pre-aggregate common query patterns into materialized views or aggregate tables. A dashboard querying daily revenue by region should not re-scan 3 years of fact data on every page load. Maintain a daily_revenue_by_region table updated incrementally each day. Tools like dbt handle incremental materialization with merge semantics. For sub-second BI dashboards, export pre-aggregated results to Redis or an OLAP-specialized engine like Apache Druid or ClickHouse that keeps aggregated data in memory.

Key Interview Discussion Points

  • Star schema vs. snowflake schema: when to normalize dimension tables
  • Columnar storage: dictionary encoding, RLE compression, vectorized execution
  • Partition pruning vs. clustering: which eliminates more I/O for your query patterns
  • SCD Type 1 vs. Type 2 vs. Type 3: choosing based on historical accuracy requirements
  • dbt incremental models: append vs. merge strategies for large fact tables
  • Data lakehouse: Delta Lake / Iceberg enabling ACID transactions on object storage
Scroll to Top