Data warehouses are purpose-built for analytical queries — aggregations, joins, and scans across billions of rows that would bring an OLTP database to its knees. Snowflake, BigQuery, and Redshift power the analytics behind business intelligence, reporting, and data science at companies from startups to enterprises. This guide covers data warehouse architecture, schema design, and query optimization — essential for data engineering and system design interviews.
OLTP vs OLAP
OLTP (Online Transaction Processing): the operational database serving your application. Characteristics: short, simple queries (get user by ID, insert an order), high concurrency (thousands of transactions per second), row-oriented storage (each row stored contiguously on disk), normalized schema (3NF to minimize redundancy), and low latency (single-digit milliseconds). Examples: PostgreSQL, MySQL, DynamoDB. OLAP (Online Analytical Processing): the analytical database for reporting and business intelligence. Characteristics: complex queries (aggregate revenue by region by quarter with year-over-year comparison), scan millions to billions of rows, columnar storage (each column stored separately for efficient scans and compression), denormalized schema (star/snowflake schema for query simplicity), and higher latency acceptable (seconds to minutes). Examples: Snowflake, BigQuery, Redshift, ClickHouse. Why separate systems: running analytical queries on an OLTP database degrades transactional performance. A report scanning 100 million rows locks resources needed by user-facing queries. The data warehouse receives data from the OLTP database via ETL/ELT pipelines and serves analytical workloads without impacting production.
Columnar Storage
Row-oriented storage (PostgreSQL): stores all columns of a row together on disk. Reading one column requires reading the entire row. Efficient for OLTP (point lookups fetch the whole row). Inefficient for analytics (scanning one column reads all columns). Columnar storage (Redshift, BigQuery, Parquet): stores each column separately. Reading one column reads only that column data. Advantages for analytics: (1) I/O reduction — a query like SELECT AVG(price) FROM orders reads only the price column, skipping order_id, customer_id, shipping_address, etc. If price is 10% of the row size, columnar reads 10x less data. (2) Compression — columns contain homogeneous data (all integers, all dates, all short strings). Homogeneous data compresses much better than mixed-type rows. Run-length encoding for repeated values, dictionary encoding for low-cardinality strings, and delta encoding for timestamps achieve 5-20x compression. (3) Vectorized execution — process entire column vectors in CPU-friendly batches using SIMD instructions. Processes millions of values per second. Trade-off: columnar storage is inefficient for point lookups (fetching all columns of one row requires reading from multiple column files). This is acceptable for analytics where queries always scan ranges.
Star Schema and Dimensional Modeling
Star schema is the standard data warehouse schema design. Center: a fact table containing measurable events (orders, clicks, transactions). Each row is one event with: numeric measures (revenue, quantity, duration) and foreign keys to dimension tables. Surrounding: dimension tables containing descriptive attributes. dim_customer (customer_id, name, segment, region), dim_product (product_id, name, category, brand), dim_date (date_id, date, month, quarter, year, is_holiday), dim_store (store_id, name, city, state). Example query: SELECT d.quarter, p.category, SUM(f.revenue) FROM fact_orders f JOIN dim_date d ON f.date_id = d.date_id JOIN dim_product p ON f.product_id = p.product_id WHERE d.year = 2026 GROUP BY d.quarter, p.category. The star schema denormalizes dimensions (customer address is stored in dim_customer, not in a separate dim_address table). This simplifies queries (fewer JOINs) at the cost of some data redundancy. Snowflake schema: dimensions are normalized (dim_product -> dim_category -> dim_department). More normalized but more JOINs. Star schema is preferred for query simplicity and performance.
Snowflake, BigQuery, and Redshift
Snowflake: separates compute from storage. Data is stored in S3 (cheap, durable). Compute clusters (virtual warehouses) spin up to run queries and shut down when idle. Pay for storage and compute independently. Key features: automatic scaling (resize or add compute clusters on demand), zero-copy cloning (clone a database instantly for testing without duplicating storage), time travel (query data as of any point in the last 90 days), and multi-cloud (runs on AWS, GCP, Azure). BigQuery: serverless — no infrastructure to manage. Submit a query, BigQuery allocates resources, executes, and returns results. Pay per query (bytes scanned). Key features: no cluster management, automatic optimization, ML integration (BigQuery ML for in-warehouse model training), and petabyte-scale with no tuning. Cost control: partition and cluster tables to reduce bytes scanned. Redshift: AWS-managed columnar database. Provisioned clusters with fixed node types. Spectrum: query data in S3 directly without loading. Redshift Serverless: similar to BigQuery pay-per-query model. Most mature AWS-native option but requires more tuning than Snowflake or BigQuery. Choice: BigQuery for serverless simplicity, Snowflake for multi-cloud and compute-storage separation, Redshift for deep AWS integration.
Query Optimization in Data Warehouses
Optimization strategies: (1) Partitioning — divide tables by a column (typically date). A query filtering WHERE date >= 2026-01-01 reads only the 2026 partitions, skipping years of historical data. BigQuery supports partition pruning natively. (2) Clustering/Sort keys — within each partition, sort data by frequently filtered columns (customer_id, product_category). This enables efficient range scans and skip-scanning. Redshift: SORTKEY. BigQuery: CLUSTER BY. Snowflake: automatic clustering. (3) Materialized views — pre-computed query results that are automatically refreshed. A dashboard showing daily revenue can read from a materialized view (instant) instead of scanning the fact table (minutes). (4) Approximate aggregation — for large-scale analytics, use approximate functions: APPROX_COUNT_DISTINCT (HyperLogLog) instead of COUNT(DISTINCT) for 10-100x speedup with ~2% error. (5) Column pruning — select only needed columns. SELECT * reads all columns; SELECT revenue, customer_id reads only two. In columnar storage, this directly reduces I/O. (6) Avoid cross-joins and Cartesian products — these explode the result set. Always include JOIN conditions. In interviews, mention partitioning and clustering as the first optimization steps for any data warehouse query.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why do data warehouses use columnar storage instead of row storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Row storage (PostgreSQL) stores all columns of a row together. Reading one column reads the entire row. Analytical queries typically scan one or a few columns across millions of rows — row storage wastes I/O reading unused columns. Columnar storage stores each column separately. Benefits: (1) I/O reduction — SELECT AVG(price) FROM orders reads only the price column, skipping all other columns. If price is 10% of row size, columnar reads 10x less data. (2) Compression — columns contain homogeneous data (all integers, all dates). Homogeneous data compresses 5-20x better than mixed-type rows using run-length encoding, dictionary encoding, and delta encoding. (3) Vectorized processing — process entire column vectors using CPU SIMD instructions for millions of values per second. Trade-off: point lookups (fetch all columns of one row) are slower because each column must be read from a different file. This is acceptable for analytics where queries always scan ranges, not individual rows.”}},{“@type”:”Question”,”name”:”What is a star schema and why is it used in data warehouses?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Star schema has a central fact table surrounded by dimension tables. The fact table stores measurable events (orders, clicks, transactions) with numeric measures (revenue, quantity) and foreign keys to dimensions. Dimension tables store descriptive attributes: dim_customer (name, segment, region), dim_product (name, category, brand), dim_date (date, month, quarter, year). Example query: SELECT quarter, category, SUM(revenue) FROM fact_orders JOIN dim_date JOIN dim_product GROUP BY quarter, category. Star schema denormalizes dimensions (customer address in dim_customer, not separate tables). This simplifies queries (fewer JOINs) at the cost of some redundancy. Snowflake schema normalizes dimensions further (dim_product -> dim_category -> dim_department) — more normalized but more JOINs. Star schema is preferred for query simplicity and performance. It is the standard approach for dimensional modeling in data warehouses.”}},{“@type”:”Question”,”name”:”How do you choose between Snowflake, BigQuery, and Redshift?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Snowflake: separates compute from storage (S3). Spin up compute clusters on demand, shut down when idle. Pay independently for storage and compute. Multi-cloud (AWS, GCP, Azure). Best for: organizations wanting compute-storage separation, multi-cloud flexibility, and features like zero-copy cloning and time travel. BigQuery: fully serverless. Submit a query, BigQuery handles everything. Pay per bytes scanned. No cluster management or tuning. Best for: teams wanting zero infrastructure management and simple pricing. Includes BigQuery ML for in-warehouse model training. Redshift: AWS-managed columnar database. Provisioned clusters or serverless mode. Spectrum queries S3 data directly. Best for: deep AWS integration and teams already on AWS. Requires more tuning than alternatives. Decision: BigQuery for serverless simplicity, Snowflake for multi-cloud and separation of compute/storage, Redshift for AWS-native shops. All three handle petabyte-scale analytics.”}},{“@type”:”Question”,”name”:”What are the key query optimization strategies for data warehouses?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Top optimization strategies: (1) Partitioning — divide tables by date (most common). Queries filtering by date read only relevant partitions, skipping years of data. Essential for any large table. (2) Clustering/Sort keys — within partitions, sort by frequently filtered columns. Enables efficient range scans. Redshift SORTKEY, BigQuery CLUSTER BY, Snowflake automatic clustering. (3) Column pruning — SELECT only needed columns. In columnar storage, fewer columns = proportionally less I/O. Avoid SELECT *. (4) Materialized views — pre-computed aggregations automatically refreshed. Dashboards read from views (instant) instead of scanning fact tables (minutes). (5) Approximate aggregation — APPROX_COUNT_DISTINCT uses HyperLogLog for 10-100x speedup with ~2% error. Acceptable for dashboards. (6) Partition pruning in JOINs — ensure JOIN conditions include the partition key so only relevant partitions are scanned on both sides. In interviews, always mention partitioning and clustering first — they provide the biggest performance improvement for the least effort.”}}]}