System Design: Data Warehouse Architecture — Snowflake, BigQuery, Redshift, Star Schema, OLAP, Columnar Storage

Data warehouses are purpose-built for analytical queries — aggregations, joins, and scans across billions of rows that would bring an OLTP database to its knees. Snowflake, BigQuery, and Redshift power the analytics behind business intelligence, reporting, and data science at companies from startups to enterprises. This guide covers data warehouse architecture, schema design, and query optimization — essential for data engineering and system design interviews.

OLTP vs OLAP

OLTP (Online Transaction Processing): the operational database serving your application. Characteristics: short, simple queries (get user by ID, insert an order), high concurrency (thousands of transactions per second), row-oriented storage (each row stored contiguously on disk), normalized schema (3NF to minimize redundancy), and low latency (single-digit milliseconds). Examples: PostgreSQL, MySQL, DynamoDB. OLAP (Online Analytical Processing): the analytical database for reporting and business intelligence. Characteristics: complex queries (aggregate revenue by region by quarter with year-over-year comparison), scan millions to billions of rows, columnar storage (each column stored separately for efficient scans and compression), denormalized schema (star/snowflake schema for query simplicity), and higher latency acceptable (seconds to minutes). Examples: Snowflake, BigQuery, Redshift, ClickHouse. Why separate systems: running analytical queries on an OLTP database degrades transactional performance. A report scanning 100 million rows locks resources needed by user-facing queries. The data warehouse receives data from the OLTP database via ETL/ELT pipelines and serves analytical workloads without impacting production.

Columnar Storage

Row-oriented storage (PostgreSQL): stores all columns of a row together on disk. Reading one column requires reading the entire row. Efficient for OLTP (point lookups fetch the whole row). Inefficient for analytics (scanning one column reads all columns). Columnar storage (Redshift, BigQuery, Parquet): stores each column separately. Reading one column reads only that column data. Advantages for analytics: (1) I/O reduction — a query like SELECT AVG(price) FROM orders reads only the price column, skipping order_id, customer_id, shipping_address, etc. If price is 10% of the row size, columnar reads 10x less data. (2) Compression — columns contain homogeneous data (all integers, all dates, all short strings). Homogeneous data compresses much better than mixed-type rows. Run-length encoding for repeated values, dictionary encoding for low-cardinality strings, and delta encoding for timestamps achieve 5-20x compression. (3) Vectorized execution — process entire column vectors in CPU-friendly batches using SIMD instructions. Processes millions of values per second. Trade-off: columnar storage is inefficient for point lookups (fetching all columns of one row requires reading from multiple column files). This is acceptable for analytics where queries always scan ranges.

Star Schema and Dimensional Modeling

Star schema is the standard data warehouse schema design. Center: a fact table containing measurable events (orders, clicks, transactions). Each row is one event with: numeric measures (revenue, quantity, duration) and foreign keys to dimension tables. Surrounding: dimension tables containing descriptive attributes. dim_customer (customer_id, name, segment, region), dim_product (product_id, name, category, brand), dim_date (date_id, date, month, quarter, year, is_holiday), dim_store (store_id, name, city, state). Example query: SELECT d.quarter, p.category, SUM(f.revenue) FROM fact_orders f JOIN dim_date d ON f.date_id = d.date_id JOIN dim_product p ON f.product_id = p.product_id WHERE d.year = 2026 GROUP BY d.quarter, p.category. The star schema denormalizes dimensions (customer address is stored in dim_customer, not in a separate dim_address table). This simplifies queries (fewer JOINs) at the cost of some data redundancy. Snowflake schema: dimensions are normalized (dim_product -> dim_category -> dim_department). More normalized but more JOINs. Star schema is preferred for query simplicity and performance.

Snowflake, BigQuery, and Redshift

Snowflake: separates compute from storage. Data is stored in S3 (cheap, durable). Compute clusters (virtual warehouses) spin up to run queries and shut down when idle. Pay for storage and compute independently. Key features: automatic scaling (resize or add compute clusters on demand), zero-copy cloning (clone a database instantly for testing without duplicating storage), time travel (query data as of any point in the last 90 days), and multi-cloud (runs on AWS, GCP, Azure). BigQuery: serverless — no infrastructure to manage. Submit a query, BigQuery allocates resources, executes, and returns results. Pay per query (bytes scanned). Key features: no cluster management, automatic optimization, ML integration (BigQuery ML for in-warehouse model training), and petabyte-scale with no tuning. Cost control: partition and cluster tables to reduce bytes scanned. Redshift: AWS-managed columnar database. Provisioned clusters with fixed node types. Spectrum: query data in S3 directly without loading. Redshift Serverless: similar to BigQuery pay-per-query model. Most mature AWS-native option but requires more tuning than Snowflake or BigQuery. Choice: BigQuery for serverless simplicity, Snowflake for multi-cloud and compute-storage separation, Redshift for deep AWS integration.

Query Optimization in Data Warehouses

Optimization strategies: (1) Partitioning — divide tables by a column (typically date). A query filtering WHERE date >= 2026-01-01 reads only the 2026 partitions, skipping years of historical data. BigQuery supports partition pruning natively. (2) Clustering/Sort keys — within each partition, sort data by frequently filtered columns (customer_id, product_category). This enables efficient range scans and skip-scanning. Redshift: SORTKEY. BigQuery: CLUSTER BY. Snowflake: automatic clustering. (3) Materialized views — pre-computed query results that are automatically refreshed. A dashboard showing daily revenue can read from a materialized view (instant) instead of scanning the fact table (minutes). (4) Approximate aggregation — for large-scale analytics, use approximate functions: APPROX_COUNT_DISTINCT (HyperLogLog) instead of COUNT(DISTINCT) for 10-100x speedup with ~2% error. (5) Column pruning — select only needed columns. SELECT * reads all columns; SELECT revenue, customer_id reads only two. In columnar storage, this directly reduces I/O. (6) Avoid cross-joins and Cartesian products — these explode the result set. Always include JOIN conditions. In interviews, mention partitioning and clustering as the first optimization steps for any data warehouse query.

Scroll to Top