What Is a Data Warehouse?
A data warehouse (DW) is an analytical database optimized for reading large volumes of historical data. It serves business intelligence (BI), reporting, and data science workloads — queries like “what was our revenue by product category last quarter?” or “how has user retention changed month over month?” These queries scan millions to billions of rows, aggregate across many dimensions, and must return in seconds to minutes. Traditional row-oriented databases (PostgreSQL, MySQL) are designed for OLTP (single-row lookups and updates) and perform poorly for these analytical patterns.
Columnar Storage: Why It Matters
Row-oriented storage stores all columns of a row together: [user_id, name, email, created_at, country]. A query for “total revenue by country” reads all columns for every row, even though it only needs two. Columnar storage stores each column separately: all user_ids together, all countries together. The same query reads only the country and revenue columns — skipping 80% of disk I/O. Additional benefits:
- Compression: a column of countries has high repetition (US, US, US, UK, US…). Run-length encoding achieves 10-20× compression. The query reads 1/10th the data.
- Vectorized execution: SIMD instructions process 256-512 bits of column data simultaneously — a comparison against 8+ values in one CPU instruction.
- Late materialization: apply filters on each column independently before joining. Only rows passing all filters are materialized into full records.
ClickHouse, BigQuery, Snowflake, Redshift, and DuckDB all use columnar storage.
Star Schema Design
The star schema is the standard data warehouse modeling approach. At the center: a fact table with high-cardinality measurements (sales, events, page views). Surrounding it: dimension tables with descriptive attributes.
-- Fact table: one row per order line item
CREATE TABLE fact_order_items (
order_id BIGINT,
product_id INT,
customer_id INT,
date_id INT,
quantity INT,
unit_price DECIMAL(10,2),
discount DECIMAL(5,2),
revenue DECIMAL(10,2)
) PARTITIONED BY (date_id);
-- Dimension tables: descriptive attributes
CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
name TEXT,
category TEXT,
subcategory TEXT,
brand TEXT,
supplier_id INT
);
CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
date DATE,
day_of_week TEXT,
month INT,
quarter INT,
year INT,
is_holiday BOOLEAN
);
-- Typical analytical query:
SELECT p.category, d.quarter, SUM(f.revenue) as total_revenue
FROM fact_order_items f
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.year = 2025
GROUP BY p.category, d.quarter
ORDER BY d.quarter, total_revenue DESC;
ELT vs ETL
Traditional ETL (Extract-Transform-Load): extract data from sources, transform it in a staging server or ETL tool, then load the final cleaned data into the DW. Requires significant compute outside the warehouse. Modern ELT (Extract-Load-Transform): extract raw data from sources, load it into the warehouse as-is (cheap with cloud storage), then use the warehouse’s massive compute to transform it. Tools: dbt (data build tool) defines transformations as SQL SELECT statements — dbt compiles them to CREATE TABLE AS or CREATE VIEW statements and executes them in the warehouse. This approach is more scalable, testable (dbt tests run in the warehouse), and easier to iterate on than traditional ETL.
Partitioning and Clustering
Partitioning: divide a large table into smaller physical segments based on a column. Queries that filter on the partition column only scan relevant partitions, skipping the rest (partition pruning). Common partitioning columns: date/time (the most common — queries usually have time ranges), region/country (for geo-specific queries).
-- BigQuery: partition by ingestion time or date column
CREATE TABLE orders
PARTITION BY DATE(created_at)
OPTIONS(partition_expiration_days=365);
-- Query scans only partitions matching the filter:
SELECT * FROM orders WHERE created_at >= '2025-01-01'
-- Reads only Jan-Apr 2025 partitions, not 3 years of history
Clustering: within each partition, sort rows by one or more columns. BigQuery and Snowflake call this clustering; Redshift calls it sort keys. A query filtering on the cluster column reads contiguous blocks on disk (not scattered), improving scan efficiency by 2-10×. Cluster on the columns most commonly used in WHERE clauses (after the partition column).
Snowflake vs BigQuery vs Redshift
| Feature | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Architecture | Multi-cluster shared storage | Serverless (auto-scales) | Cluster-based (MPP) |
| Pricing | Credits per compute-second + storage | $5/TB scanned (or flat-rate slots) | Instance-hours + storage |
| Scaling | Add/remove virtual warehouses instantly | Fully managed, auto-scales | Resize cluster (minutes) |
| Concurrency | Unlimited virtual warehouses | High (serverless) | Limited by cluster size |
| Semi-structured data | VARIANT type, FLATTEN | JSON/ARRAY native | SUPER type |
Snowflake separates storage (S3) from compute (virtual warehouses). Multiple teams can run different-sized warehouses against the same data simultaneously without contention. BigQuery is serverless — no cluster to manage; ideal for unpredictable workloads.
Slowly Changing Dimensions (SCD)
Dimension data changes over time: a customer moves cities, a product changes categories. How do you preserve historical accuracy in reports? Three strategies:
- SCD Type 1: overwrite — update the dimension record in place. Simple but loses history (“what was this customer’s city when they placed that order 2 years ago?” is unanswerable).
- SCD Type 2: add a new row for each change with effective_date and end_date (or is_current flag). Fact table foreign keys point to the specific version of the dimension active at transaction time. Preserves full history. Standard approach for customer and product dimensions.
- SCD Type 3: add a “previous_value” column. Tracks only one level of change — simpler but limited.
Query Optimization in Column Stores
- Avoid SELECT *: reads all columns; defeats columnar advantage. Always specify needed columns.
- Filter on partition columns first: the optimizer will prune partitions; put date filters in WHERE.
- Use approximate functions for exploration: APPROX_COUNT_DISTINCT() is 50× faster than COUNT(DISTINCT) with 1-2% error — acceptable for dashboards.
- Materialize expensive subqueries: CREATE TABLE AS (expensive query) — subsequent queries hit cached results. In dbt:
{{ config(materialized='table') }}. - Avoid cross joins and accidental Cartesian products: always confirm JOIN cardinality. A cross join of two 1M-row tables produces 1 trillion rows.
Key Interview Points
- Columnar storage skips irrelevant columns — 10-100× I/O reduction for analytical queries
- Star schema: fact table (measurements) surrounded by dimension tables (attributes)
- Modern ELT: load raw data, transform with dbt SQL — warehouse compute is cheap and scalable
- Partition by date; cluster by high-cardinality filter columns within partitions
- SCD Type 2 preserves historical accuracy for dimension changes
- Snowflake: compute/storage separation with multi-warehouse concurrency; BigQuery: serverless auto-scaling
Frequently Asked Questions
Why is columnar storage faster than row storage for analytical queries?
Row-oriented databases store all columns of a row contiguously on disk: [user_id, name, email, country, revenue]. An analytical query for "total revenue by country" must read all columns for every row — even though it only needs two. For a table with 100 columns and 1 billion rows, this means reading all 100 columns when only 2 are needed. Columnar storage organizes data by column: all user_ids together, all countries together, all revenues together. The same query reads only the country and revenue column files — skipping 98% of disk I/O. Additional speedups: (1) Column compression: a column containing "US, US, UK, US, US, US, CA" compresses dramatically with run-length encoding (US×4, UK×1, US×2, CA×1) — sometimes 10-20× compression ratio. The query reads 1/20th the raw data. (2) Vectorized execution: SIMD CPU instructions can process 8-16 column values in a single instruction cycle. (3) Late materialization: filters are applied column-by-column before reconstructing full records — only rows passing all filters are assembled. For OLTP (looking up a single user by ID), row storage is better — one disk seek gets all columns. For OLAP (scanning millions of rows, touching a few columns), columnar wins by a large margin.
What is the difference between a fact table and a dimension table in a star schema?
In a star schema data warehouse, a fact table records measurable business events or transactions with numeric metrics. Examples: sales transactions (quantity, revenue, discount), web events (page views, clicks, session duration), financial trades (price, volume). Fact tables are typically very wide (many rows) — billions of rows in large warehouses — but have few columns, mostly foreign keys to dimension tables and numeric measures. A dimension table describes the "who, what, where, when" context of facts. Examples: dim_customer (name, city, country, segment), dim_product (name, category, brand, price), dim_date (date, day_of_week, month, quarter, is_holiday). Dimension tables have many descriptive columns but relatively few rows (thousands to millions). The "star" shape: the fact table is at the center with foreign key relationships pointing outward to each dimension table. Analytical queries join the fact table to dimension tables to add context: "SUM(revenue) GROUP BY customer.country, date.quarter" — this requires joining fact_sales to dim_customer and dim_date. The star schema is optimized for this join pattern: simple queries, one hop from fact to any dimension.
How does Snowflake's architecture separate compute from storage?
Traditional data warehouses (Redshift, Teradata) tightly couple storage and compute — adding more compute requires provisioning more nodes with local disks, and storage scales with compute even when you only need one or the other. Snowflake uses a three-layer architecture: (1) Cloud storage (S3, Azure Blob, GCS): all data is stored in Snowflake's compressed columnar format (micro-partitions of ~16MB each) in object storage. Storage scales independently and is extremely cheap. (2) Virtual warehouses (compute): clusters of EC2 instances that do the actual query execution. Each virtual warehouse loads the data it needs from S3 into local SSD cache for fast access. You can have dozens of virtual warehouses (one per team) running simultaneously against the same data without contention — perfect for multi-tenant analytics. Warehouses are suspended when idle (cost zero) and resume in seconds. (3) Cloud services layer: query parser, optimizer, metadata store, access control. This decoupling means a BI team can run a large warehouse for their dashboards while the data science team runs a separate warehouse for ML training — both read the same tables in S3, no data copying. You pay for compute only when running queries.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why is columnar storage faster than row storage for analytical queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Row-oriented databases store all columns of a row contiguously on disk: [user_id, name, email, country, revenue]. An analytical query for “total revenue by country” must read all columns for every row — even though it only needs two. For a table with 100 columns and 1 billion rows, this means reading all 100 columns when only 2 are needed. Columnar storage organizes data by column: all user_ids together, all countries together, all revenues together. The same query reads only the country and revenue column files — skipping 98% of disk I/O. Additional speedups: (1) Column compression: a column containing “US, US, UK, US, US, US, CA” compresses dramatically with run-length encoding (US×4, UK×1, US×2, CA×1) — sometimes 10-20× compression ratio. The query reads 1/20th the raw data. (2) Vectorized execution: SIMD CPU instructions can process 8-16 column values in a single instruction cycle. (3) Late materialization: filters are applied column-by-column before reconstructing full records — only rows passing all filters are assembled. For OLTP (looking up a single user by ID), row storage is better — one disk seek gets all columns. For OLAP (scanning millions of rows, touching a few columns), columnar wins by a large margin.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a fact table and a dimension table in a star schema?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a star schema data warehouse, a fact table records measurable business events or transactions with numeric metrics. Examples: sales transactions (quantity, revenue, discount), web events (page views, clicks, session duration), financial trades (price, volume). Fact tables are typically very wide (many rows) — billions of rows in large warehouses — but have few columns, mostly foreign keys to dimension tables and numeric measures. A dimension table describes the “who, what, where, when” context of facts. Examples: dim_customer (name, city, country, segment), dim_product (name, category, brand, price), dim_date (date, day_of_week, month, quarter, is_holiday). Dimension tables have many descriptive columns but relatively few rows (thousands to millions). The “star” shape: the fact table is at the center with foreign key relationships pointing outward to each dimension table. Analytical queries join the fact table to dimension tables to add context: “SUM(revenue) GROUP BY customer.country, date.quarter” — this requires joining fact_sales to dim_customer and dim_date. The star schema is optimized for this join pattern: simple queries, one hop from fact to any dimension.”
}
},
{
“@type”: “Question”,
“name”: “How does Snowflake’s architecture separate compute from storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Traditional data warehouses (Redshift, Teradata) tightly couple storage and compute — adding more compute requires provisioning more nodes with local disks, and storage scales with compute even when you only need one or the other. Snowflake uses a three-layer architecture: (1) Cloud storage (S3, Azure Blob, GCS): all data is stored in Snowflake’s compressed columnar format (micro-partitions of ~16MB each) in object storage. Storage scales independently and is extremely cheap. (2) Virtual warehouses (compute): clusters of EC2 instances that do the actual query execution. Each virtual warehouse loads the data it needs from S3 into local SSD cache for fast access. You can have dozens of virtual warehouses (one per team) running simultaneously against the same data without contention — perfect for multi-tenant analytics. Warehouses are suspended when idle (cost zero) and resume in seconds. (3) Cloud services layer: query parser, optimizer, metadata store, access control. This decoupling means a BI team can run a large warehouse for their dashboards while the data science team runs a separate warehouse for ML training — both read the same tables in S3, no data copying. You pay for compute only when running queries.”
}
}
]
}