Question 1

Why do data warehouses use columnar storage instead of row storage?

Accepted Answer

Row storage (PostgreSQL) stores all columns of a row together. Reading one column reads the entire row. Analytical queries typically scan one or a few columns across millions of rows -- row storage wastes I/O reading unused columns. Columnar storage stores each column separately. Benefits: (1) I/O reduction -- SELECT AVG(price) FROM orders reads only the price column, skipping all other columns. If price is 10% of row size, columnar reads 10x less data. (2) Compression -- columns contain homogeneous data (all integers, all dates). Homogeneous data compresses 5-20x better than mixed-type rows using run-length encoding, dictionary encoding, and delta encoding. (3) Vectorized processing -- process entire column vectors using CPU SIMD instructions for millions of values per second. Trade-off: point lookups (fetch all columns of one row) are slower because each column must be read from a different file. This is acceptable for analytics where queries always scan ranges, not individual rows.

Question 2

What is a star schema and why is it used in data warehouses?

Accepted Answer

Star schema has a central fact table surrounded by dimension tables. The fact table stores measurable events (orders, clicks, transactions) with numeric measures (revenue, quantity) and foreign keys to dimensions. Dimension tables store descriptive attributes: dim_customer (name, segment, region), dim_product (name, category, brand), dim_date (date, month, quarter, year). Example query: SELECT quarter, category, SUM(revenue) FROM fact_orders JOIN dim_date JOIN dim_product GROUP BY quarter, category. Star schema denormalizes dimensions (customer address in dim_customer, not separate tables). This simplifies queries (fewer JOINs) at the cost of some redundancy. Snowflake schema normalizes dimensions further (dim_product -> dim_category -> dim_department) -- more normalized but more JOINs. Star schema is preferred for query simplicity and performance. It is the standard approach for dimensional modeling in data warehouses.

Question 3

How do you choose between Snowflake, BigQuery, and Redshift?

Accepted Answer

Snowflake: separates compute from storage (S3). Spin up compute clusters on demand, shut down when idle. Pay independently for storage and compute. Multi-cloud (AWS, GCP, Azure). Best for: organizations wanting compute-storage separation, multi-cloud flexibility, and features like zero-copy cloning and time travel. BigQuery: fully serverless. Submit a query, BigQuery handles everything. Pay per bytes scanned. No cluster management or tuning. Best for: teams wanting zero infrastructure management and simple pricing. Includes BigQuery ML for in-warehouse model training. Redshift: AWS-managed columnar database. Provisioned clusters or serverless mode. Spectrum queries S3 data directly. Best for: deep AWS integration and teams already on AWS. Requires more tuning than alternatives. Decision: BigQuery for serverless simplicity, Snowflake for multi-cloud and separation of compute/storage, Redshift for AWS-native shops. All three handle petabyte-scale analytics.

Question 4

What are the key query optimization strategies for data warehouses?

Accepted Answer

Top optimization strategies: (1) Partitioning -- divide tables by date (most common). Queries filtering by date read only relevant partitions, skipping years of data. Essential for any large table. (2) Clustering/Sort keys -- within partitions, sort by frequently filtered columns. Enables efficient range scans. Redshift SORTKEY, BigQuery CLUSTER BY, Snowflake automatic clustering. (3) Column pruning -- SELECT only needed columns. In columnar storage, fewer columns = proportionally less I/O. Avoid SELECT *. (4) Materialized views -- pre-computed aggregations automatically refreshed. Dashboards read from views (instant) instead of scanning fact tables (minutes). (5) Approximate aggregation -- APPROX_COUNT_DISTINCT uses HyperLogLog for 10-100x speedup with ~2% error. Acceptable for dashboards. (6) Partition pruning in JOINs -- ensure JOIN conditions include the partition key so only relevant partitions are scanned on both sides. In interviews, always mention partitioning and clustering first -- they provide the biggest performance improvement for the least effort.

System Design: Data Warehouse Architecture — Snowflake, BigQuery, Redshift, Star Schema, OLAP, Columnar Storage

OLTP vs OLAP

Columnar Storage

Star Schema and Dimensional Modeling

Snowflake, BigQuery, and Redshift

Query Optimization in Data Warehouses