A data warehouse is a system optimized for analytical queries across large volumes of historical data. Unlike transactional databases, warehouses are designed to answer questions like "what were total sales by region over the last 12 months?" — aggregations over millions or billions of rows. The low level design decisions — schema, storage format, partitioning, and platform — determine whether those queries run in seconds or hours.
OLTP vs OLAP: Different Workloads, Different Designs
OLTP (Online Transaction Processing) databases are optimized for transactional workloads: insert one order, update one account balance, fetch one user record. They are normalized to reduce redundancy, use row-oriented storage, and optimize for low latency per individual row operation.
OLAP (Online Analytical Processing) databases are optimized for analytical workloads: compute total revenue by product category for the past quarter, find the top 1000 customers by lifetime value. They are denormalized to reduce joins, use columnar storage, and optimize for high throughput scans across millions of rows. The same data, but a completely different access pattern requires a fundamentally different design.
Star Schema: The Core Design Pattern
The star schema is the dominant design pattern for data warehouses. It consists of a central fact table surrounded by dimension tables — visually resembling a star.
The fact table sits at the center. It contains numeric measures (revenue, quantity, duration) and foreign keys to dimension tables. Each row represents one event or transaction at the defined grain. Fact tables are wide, tall, and the primary target of analytical queries.
Dimension tables surround the fact table. They contain descriptive attributes: customer name, product category, store region, date attributes (day of week, fiscal quarter, holiday flag). Dimensions are denormalized — redundant data is acceptable in a warehouse because storage is cheap and joins are expensive at scale.
A typical star schema for e-commerce: fact_orders (order_id, customer_id, product_id, date_id, store_id, quantity, unit_price, discount, revenue) + dim_customer, dim_product, dim_date, dim_store. Most analytical queries join the fact table to one or two dimensions — simple, fast, and easy to understand.
Snowflake Schema: Normalized Dimensions
The snowflake schema extends the star schema by normalizing dimension tables. Instead of storing product_category directly in dim_product, you split it into a separate dim_category table that dim_product references.
Benefits: reduced storage (category name stored once instead of once per product), easier updates to dimension attributes. Drawbacks: more joins required for every query, more complex SQL, harder for analysts to write ad-hoc queries.
In modern cloud warehouses with cheap columnar storage and fast join engines, the storage savings from snowflaking are rarely worth the query complexity. Most practitioners default to star schema and only snowflake when dimension tables are genuinely very large.
Fact Table Design: Grain, Additivity, and Measures
The grain of a fact table defines what one row represents. "One row per order line item" is finer grain than "one row per order." Finer grain is more flexible — you can always aggregate up, but you can’t disaggregate down. Declare the grain explicitly before building the table.
Measures fall into three categories based on how they can be aggregated:
- Additive measures can be summed across all dimensions: revenue, quantity, cost. Sum revenue across time, product, and region freely.
- Semi-additive measures can be summed across some dimensions but not all: account balance is additive across accounts but not across time (summing daily balances doesn’t give you meaningful totals). Use average or point-in-time snapshots instead.
- Non-additive measures cannot be summed at all: ratios, percentages, margins. Store the numerator and denominator as separate additive facts, compute the ratio at query time.
Slowly Changing Dimensions
Dimensions change over time: a customer moves to a new city, a product changes its category, a store gets rebranded. How you handle these changes determines whether historical reports are accurate.
SCD Type 1 (Overwrite): Simply update the existing row. Fast and simple, but historical queries now reflect the current value, not the value at the time of the transaction. Appropriate when history doesn’t matter: correcting a data entry error, updating a phone number.
SCD Type 2 (Add New Row): When a dimension attribute changes, close the old row (set end_date, is_current = false) and insert a new row with the new value (start_date = today, is_current = true). Fact table rows link to the dimension row that was current at transaction time, so historical reports remain accurate. Most important for attributes that affect business analysis: customer segment, product category, sales rep territory.
SCD Type 3 (Add Column): Add a "previous value" column alongside the current value. Tracks only one prior state — rarely sufficient in practice.
ETL vs ELT: Where Transformation Happens
ETL (Extract, Transform, Load) is the traditional approach: extract data from source systems, transform it in a separate processing layer (Spark, custom scripts, Informatica), then load clean, structured data into the warehouse. The warehouse receives only curated data.
ELT (Extract, Load, Transform) is the modern approach: load raw data into the warehouse first, then transform it using SQL inside the warehouse. This is enabled by cloud warehouses with massive parallel processing compute. Tools like dbt (data build tool) define transformations as SQL SELECT statements, compile them into warehouse-native SQL, and run them inside Snowflake/BigQuery/Redshift.
ELT advantages: raw data is preserved for reprocessing, transformations are versioned SQL in git, warehouse compute scales elastically, no separate transformation infrastructure to manage. ELT is now the dominant pattern for cloud data warehouses.
Columnar Storage: The Foundation of Warehouse Performance
Row-oriented storage (PostgreSQL, MySQL) stores all columns of a row together on disk. Reading one column requires reading every row’s worth of bytes and discarding the unused columns. For OLAP queries that read 3 columns out of 100, this wastes 97% of I/O.
Columnar storage stores each column’s values contiguously. Reading revenue for 100M rows reads only the revenue column — no wasted I/O. Columnar storage also compresses dramatically better: values in a column are the same type and often similar in value (run-length encoding compresses a status column with 3 distinct values to almost nothing). Compression ratios of 10-100x vs row storage are common, translating directly to faster queries (less data to read from disk or network).
Parquet and ORC are the open columnar file formats used in data lakes. All major cloud warehouses use columnar storage internally.
Partitioning and Clustering
Partitioning divides a table into physical segments based on a column value. Date partitioning is almost universal for fact tables: each day’s data lives in its own partition. When a query filters on date_id BETWEEN ‘2024-01-01’ AND ‘2024-03-31’, the warehouse scans only those 90 partitions and skips the rest entirely (partition pruning). Without partitioning, every query does a full table scan across years of data.
Clustering / Sorting keys sort data within partitions by additional columns. Redshift SORTKEY, Snowflake clustering keys, BigQuery clustering columns. If queries frequently filter by region after filtering by date, clustering by region within each date partition means the query reads only the rows for the target region rather than scanning all rows in the partition. Most effective on columns used in WHERE and JOIN ON clauses.
Materialized Views and Pre-Aggregation
For dashboards and reports that run the same heavy aggregations repeatedly, materialized views pre-compute and store the result. Instead of re-aggregating 1B rows for every dashboard load, the materialized view holds the pre-computed daily totals (365 rows per year) and the dashboard queries that instead.
Refresh strategies: full refresh (recompute everything on schedule — simple, expensive for large tables), incremental refresh (compute only new/changed data since last refresh — complex but efficient). Snowflake and BigQuery support both. dbt materializations (table, incremental, materialized view) are how most teams manage this in practice.
Cloud Platform Architectures: Snowflake, BigQuery, Redshift
Snowflake separates compute and storage completely. Storage is in S3/Azure Blob/GCS. Virtual warehouses are compute clusters that you start and stop independently — pay only when running queries. Multiple warehouses can query the same data simultaneously without contention. Time travel lets you query data as it existed at any point in the past 90 days. Zero-copy clone creates an instant copy of a table or database for dev/test without duplicating storage.
BigQuery is fully serverless — no clusters to manage, no compute to start. You pay per bytes scanned (or use flat-rate slots). Nested and repeated fields natively model JSON-like data without joins. BI Engine caches frequently accessed data in memory for sub-second dashboard queries. Partitioning and clustering reduce bytes scanned and lower costs directly.
Redshift is PostgreSQL-based with a leader node that parses and plans queries and compute nodes that execute them in parallel. COPY command is optimized for bulk loading from S3. SORTKEY and DISTKEY control how data is physically sorted and distributed across compute nodes — critical for join performance. Redshift Spectrum queries data directly in S3 without loading, enabling a data lake + warehouse hybrid pattern.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety