System Design Interview: Data Warehouse and OLAP Architecture

What Is a Data Warehouse?

A data warehouse (DW) is an analytical database optimized for reading large volumes of historical data. It serves business intelligence (BI), reporting, and data science workloads — queries like “what was our revenue by product category last quarter?” or “how has user retention changed month over month?” These queries scan millions to billions of rows, aggregate across many dimensions, and must return in seconds to minutes. Traditional row-oriented databases (PostgreSQL, MySQL) are designed for OLTP (single-row lookups and updates) and perform poorly for these analytical patterns.

Columnar Storage: Why It Matters

Row-oriented storage stores all columns of a row together: [user_id, name, email, created_at, country]. A query for “total revenue by country” reads all columns for every row, even though it only needs two. Columnar storage stores each column separately: all user_ids together, all countries together. The same query reads only the country and revenue columns — skipping 80% of disk I/O. Additional benefits:

  • Compression: a column of countries has high repetition (US, US, US, UK, US…). Run-length encoding achieves 10-20× compression. The query reads 1/10th the data.
  • Vectorized execution: SIMD instructions process 256-512 bits of column data simultaneously — a comparison against 8+ values in one CPU instruction.
  • Late materialization: apply filters on each column independently before joining. Only rows passing all filters are materialized into full records.

ClickHouse, BigQuery, Snowflake, Redshift, and DuckDB all use columnar storage.

Star Schema Design

The star schema is the standard data warehouse modeling approach. At the center: a fact table with high-cardinality measurements (sales, events, page views). Surrounding it: dimension tables with descriptive attributes.


-- Fact table: one row per order line item
CREATE TABLE fact_order_items (
    order_id        BIGINT,
    product_id      INT,
    customer_id     INT,
    date_id         INT,
    quantity        INT,
    unit_price      DECIMAL(10,2),
    discount        DECIMAL(5,2),
    revenue         DECIMAL(10,2)
) PARTITIONED BY (date_id);

-- Dimension tables: descriptive attributes
CREATE TABLE dim_product (
    product_id      INT PRIMARY KEY,
    name            TEXT,
    category        TEXT,
    subcategory     TEXT,
    brand           TEXT,
    supplier_id     INT
);

CREATE TABLE dim_date (
    date_id         INT PRIMARY KEY,
    date            DATE,
    day_of_week     TEXT,
    month           INT,
    quarter         INT,
    year            INT,
    is_holiday      BOOLEAN
);

-- Typical analytical query:
SELECT p.category, d.quarter, SUM(f.revenue) as total_revenue
FROM fact_order_items f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.year = 2025
GROUP BY p.category, d.quarter
ORDER BY d.quarter, total_revenue DESC;

ELT vs ETL

Traditional ETL (Extract-Transform-Load): extract data from sources, transform it in a staging server or ETL tool, then load the final cleaned data into the DW. Requires significant compute outside the warehouse. Modern ELT (Extract-Load-Transform): extract raw data from sources, load it into the warehouse as-is (cheap with cloud storage), then use the warehouse’s massive compute to transform it. Tools: dbt (data build tool) defines transformations as SQL SELECT statements — dbt compiles them to CREATE TABLE AS or CREATE VIEW statements and executes them in the warehouse. This approach is more scalable, testable (dbt tests run in the warehouse), and easier to iterate on than traditional ETL.

Partitioning and Clustering

Partitioning: divide a large table into smaller physical segments based on a column. Queries that filter on the partition column only scan relevant partitions, skipping the rest (partition pruning). Common partitioning columns: date/time (the most common — queries usually have time ranges), region/country (for geo-specific queries).


-- BigQuery: partition by ingestion time or date column
CREATE TABLE orders
PARTITION BY DATE(created_at)
OPTIONS(partition_expiration_days=365);

-- Query scans only partitions matching the filter:
SELECT * FROM orders WHERE created_at >= '2025-01-01'
-- Reads only Jan-Apr 2025 partitions, not 3 years of history

Clustering: within each partition, sort rows by one or more columns. BigQuery and Snowflake call this clustering; Redshift calls it sort keys. A query filtering on the cluster column reads contiguous blocks on disk (not scattered), improving scan efficiency by 2-10×. Cluster on the columns most commonly used in WHERE clauses (after the partition column).

Snowflake vs BigQuery vs Redshift

Feature Snowflake BigQuery Redshift
Architecture Multi-cluster shared storage Serverless (auto-scales) Cluster-based (MPP)
Pricing Credits per compute-second + storage $5/TB scanned (or flat-rate slots) Instance-hours + storage
Scaling Add/remove virtual warehouses instantly Fully managed, auto-scales Resize cluster (minutes)
Concurrency Unlimited virtual warehouses High (serverless) Limited by cluster size
Semi-structured data VARIANT type, FLATTEN JSON/ARRAY native SUPER type

Snowflake separates storage (S3) from compute (virtual warehouses). Multiple teams can run different-sized warehouses against the same data simultaneously without contention. BigQuery is serverless — no cluster to manage; ideal for unpredictable workloads.

Slowly Changing Dimensions (SCD)

Dimension data changes over time: a customer moves cities, a product changes categories. How do you preserve historical accuracy in reports? Three strategies:

  • SCD Type 1: overwrite — update the dimension record in place. Simple but loses history (“what was this customer’s city when they placed that order 2 years ago?” is unanswerable).
  • SCD Type 2: add a new row for each change with effective_date and end_date (or is_current flag). Fact table foreign keys point to the specific version of the dimension active at transaction time. Preserves full history. Standard approach for customer and product dimensions.
  • SCD Type 3: add a “previous_value” column. Tracks only one level of change — simpler but limited.

Query Optimization in Column Stores

  • Avoid SELECT *: reads all columns; defeats columnar advantage. Always specify needed columns.
  • Filter on partition columns first: the optimizer will prune partitions; put date filters in WHERE.
  • Use approximate functions for exploration: APPROX_COUNT_DISTINCT() is 50× faster than COUNT(DISTINCT) with 1-2% error — acceptable for dashboards.
  • Materialize expensive subqueries: CREATE TABLE AS (expensive query) — subsequent queries hit cached results. In dbt: {{ config(materialized='table') }}.
  • Avoid cross joins and accidental Cartesian products: always confirm JOIN cardinality. A cross join of two 1M-row tables produces 1 trillion rows.

Key Interview Points

  • Columnar storage skips irrelevant columns — 10-100× I/O reduction for analytical queries
  • Star schema: fact table (measurements) surrounded by dimension tables (attributes)
  • Modern ELT: load raw data, transform with dbt SQL — warehouse compute is cheap and scalable
  • Partition by date; cluster by high-cardinality filter columns within partitions
  • SCD Type 2 preserves historical accuracy for dimension changes
  • Snowflake: compute/storage separation with multi-warehouse concurrency; BigQuery: serverless auto-scaling

Frequently Asked Questions

Why is columnar storage faster than row storage for analytical queries?

Row-oriented databases store all columns of a row contiguously on disk: [user_id, name, email, country, revenue]. An analytical query for "total revenue by country" must read all columns for every row — even though it only needs two. For a table with 100 columns and 1 billion rows, this means reading all 100 columns when only 2 are needed. Columnar storage organizes data by column: all user_ids together, all countries together, all revenues together. The same query reads only the country and revenue column files — skipping 98% of disk I/O. Additional speedups: (1) Column compression: a column containing "US, US, UK, US, US, US, CA" compresses dramatically with run-length encoding (US×4, UK×1, US×2, CA×1) — sometimes 10-20× compression ratio. The query reads 1/20th the raw data. (2) Vectorized execution: SIMD CPU instructions can process 8-16 column values in a single instruction cycle. (3) Late materialization: filters are applied column-by-column before reconstructing full records — only rows passing all filters are assembled. For OLTP (looking up a single user by ID), row storage is better — one disk seek gets all columns. For OLAP (scanning millions of rows, touching a few columns), columnar wins by a large margin.

What is the difference between a fact table and a dimension table in a star schema?

In a star schema data warehouse, a fact table records measurable business events or transactions with numeric metrics. Examples: sales transactions (quantity, revenue, discount), web events (page views, clicks, session duration), financial trades (price, volume). Fact tables are typically very wide (many rows) — billions of rows in large warehouses — but have few columns, mostly foreign keys to dimension tables and numeric measures. A dimension table describes the "who, what, where, when" context of facts. Examples: dim_customer (name, city, country, segment), dim_product (name, category, brand, price), dim_date (date, day_of_week, month, quarter, is_holiday). Dimension tables have many descriptive columns but relatively few rows (thousands to millions). The "star" shape: the fact table is at the center with foreign key relationships pointing outward to each dimension table. Analytical queries join the fact table to dimension tables to add context: "SUM(revenue) GROUP BY customer.country, date.quarter" — this requires joining fact_sales to dim_customer and dim_date. The star schema is optimized for this join pattern: simple queries, one hop from fact to any dimension.

How does Snowflake's architecture separate compute from storage?

Traditional data warehouses (Redshift, Teradata) tightly couple storage and compute — adding more compute requires provisioning more nodes with local disks, and storage scales with compute even when you only need one or the other. Snowflake uses a three-layer architecture: (1) Cloud storage (S3, Azure Blob, GCS): all data is stored in Snowflake's compressed columnar format (micro-partitions of ~16MB each) in object storage. Storage scales independently and is extremely cheap. (2) Virtual warehouses (compute): clusters of EC2 instances that do the actual query execution. Each virtual warehouse loads the data it needs from S3 into local SSD cache for fast access. You can have dozens of virtual warehouses (one per team) running simultaneously against the same data without contention — perfect for multi-tenant analytics. Warehouses are suspended when idle (cost zero) and resume in seconds. (3) Cloud services layer: query parser, optimizer, metadata store, access control. This decoupling means a BI team can run a large warehouse for their dashboards while the data science team runs a separate warehouse for ML training — both read the same tables in S3, no data copying. You pay for compute only when running queries.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is columnar storage faster than row storage for analytical queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Row-oriented databases store all columns of a row contiguously on disk: [user_id, name, email, country, revenue]. An analytical query for “total revenue by country” must read all columns for every row — even though it only needs two. For a table with 100 columns and 1 billion rows, this means reading all 100 columns when only 2 are needed. Columnar storage organizes data by column: all user_ids together, all countries together, all revenues together. The same query reads only the country and revenue column files — skipping 98% of disk I/O. Additional speedups: (1) Column compression: a column containing “US, US, UK, US, US, US, CA” compresses dramatically with run-length encoding (US×4, UK×1, US×2, CA×1) — sometimes 10-20× compression ratio. The query reads 1/20th the raw data. (2) Vectorized execution: SIMD CPU instructions can process 8-16 column values in a single instruction cycle. (3) Late materialization: filters are applied column-by-column before reconstructing full records — only rows passing all filters are assembled. For OLTP (looking up a single user by ID), row storage is better — one disk seek gets all columns. For OLAP (scanning millions of rows, touching a few columns), columnar wins by a large margin.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a fact table and a dimension table in a star schema?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a star schema data warehouse, a fact table records measurable business events or transactions with numeric metrics. Examples: sales transactions (quantity, revenue, discount), web events (page views, clicks, session duration), financial trades (price, volume). Fact tables are typically very wide (many rows) — billions of rows in large warehouses — but have few columns, mostly foreign keys to dimension tables and numeric measures. A dimension table describes the “who, what, where, when” context of facts. Examples: dim_customer (name, city, country, segment), dim_product (name, category, brand, price), dim_date (date, day_of_week, month, quarter, is_holiday). Dimension tables have many descriptive columns but relatively few rows (thousands to millions). The “star” shape: the fact table is at the center with foreign key relationships pointing outward to each dimension table. Analytical queries join the fact table to dimension tables to add context: “SUM(revenue) GROUP BY customer.country, date.quarter” — this requires joining fact_sales to dim_customer and dim_date. The star schema is optimized for this join pattern: simple queries, one hop from fact to any dimension.”
}
},
{
“@type”: “Question”,
“name”: “How does Snowflake’s architecture separate compute from storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Traditional data warehouses (Redshift, Teradata) tightly couple storage and compute — adding more compute requires provisioning more nodes with local disks, and storage scales with compute even when you only need one or the other. Snowflake uses a three-layer architecture: (1) Cloud storage (S3, Azure Blob, GCS): all data is stored in Snowflake’s compressed columnar format (micro-partitions of ~16MB each) in object storage. Storage scales independently and is extremely cheap. (2) Virtual warehouses (compute): clusters of EC2 instances that do the actual query execution. Each virtual warehouse loads the data it needs from S3 into local SSD cache for fast access. You can have dozens of virtual warehouses (one per team) running simultaneously against the same data without contention — perfect for multi-tenant analytics. Warehouses are suspended when idle (cost zero) and resume in seconds. (3) Cloud services layer: query parser, optimizer, metadata store, access control. This decoupling means a BI team can run a large warehouse for their dashboards while the data science team runs a separate warehouse for ML training — both read the same tables in S3, no data copying. You pay for compute only when running queries.”
}
}
]
}

Companies That Ask This Question

Scroll to Top