Question 1

What is a star schema and why is it used in data warehouses?

Accepted Answer

A star schema organizes data into a central fact table (events with measurable quantities like orders, page views) surrounded by dimension tables (descriptive attributes like customers, products, dates). It enables fast aggregation queries with simple joins, avoids complex multi-hop joins, and is optimized for columnar storage. The name comes from the star shape when the fact table is drawn in the center with dimension tables radiating outward.

Question 2

Why does columnar storage make analytical queries faster?

Accepted Answer

Columnar storage stores each column contiguously on disk. When a query touches only 3 of 50 columns, it reads only those 3 columns, skipping 94% of I/O. Row-oriented storage reads entire rows, wasting I/O on unused columns. Columnar storage also enables better compression (dictionary encoding, run-length encoding) since values within a column are similar, yielding 5-10x compression. Vectorized execution processes batches of column values using SIMD CPU instructions, further accelerating aggregations.

Question 3

What is a Slowly Changing Dimension and how does SCD Type 2 work?

Accepted Answer

A Slowly Changing Dimension (SCD) is a dimension attribute that changes over time, like a customer moving countries or changing segments. SCD Type 2 preserves history by inserting a new row when the attribute changes, adding valid_from, valid_to, and is_current columns. Historical fact rows join to the dimension row valid at the time of the event. This preserves historical accuracy: a 2022 sale still shows the customer was in USA even if they moved to UK in 2023.

Question 4

What is the difference between ETL and ELT in data warehouse pipelines?

Accepted Answer

ETL (Extract, Transform, Load) transforms data before loading into the warehouse, requiring a separate transformation engine. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using SQL tools like dbt. Modern cloud data warehouses (BigQuery, Snowflake, Redshift) favor ELT because they have massive compute for SQL transformations, support schema-on-read for raw data, enable debugging by preserving raw data, and allow replay of transformations when logic changes.

Low Level Design: Data Warehouse Design

Star Schema: Fact and Dimension Tables

Columnar Storage: Why It Dominates OLAP

Partitioning and Clustering

ELT Pipeline: Ingestion Architecture

Slowly Changing Dimensions (SCD Type 2)

Materialized Views and Aggregation Tables

Key Interview Discussion Points