Low Level Design: Columnar Database Design

⏱ 3 min read

A columnar database stores table data organized by column rather than by row, making it dramatically more efficient for analytical queries that scan many rows but only a few columns. While row-oriented databases (PostgreSQL, MySQL) read entire rows from disk (wasting I/O on unused columns), columnar databases (BigQuery, Redshift, ClickHouse, Parquet files) read only the queried columns. This design is the foundation of modern data warehouses and OLAP systems capable of scanning billions of rows per second.

Row vs. Columnar Storage Layout

Row store: each row stored contiguously (id, name, age, salary, department). A scan for average salary reads all columns including name, age, department — wasted I/O. Columnar store: each column stored contiguously (all ids together, all names together, all salaries together). A scan for average salary reads only the salary column — minimal I/O. For a table with 50 columns and a query touching 3, columnar storage reduces I/O by 94%. The trade-off: row insert (write) requires writing to 50 column segments; reads of full rows require joining 50 column files. Columnar is optimal for analytical reads, bad for transactional writes.

-- ClickHouse: columnar OLAP database
CREATE TABLE events (
    event_date  Date,
    user_id     UInt64,
    event_type  LowCardinality(String),  -- dictionary encoded
    page_url    String,
    duration_ms UInt32,
    revenue     Decimal(10,2)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)       -- partition pruning
ORDER BY (event_date, event_type, user_id)  -- sorting key for range queries
SETTINGS index_granularity = 8192;     -- sparse index every 8192 rows

-- Analytical query: scans only event_date, event_type, revenue columns
-- ClickHouse reads 3/6 columns, skips irrelevant partitions
SELECT event_type, SUM(revenue) FROM events
WHERE event_date >= today() - 30
GROUP BY event_type ORDER BY SUM(revenue) DESC;

Columnar Compression Techniques

Columnar storage enables excellent compression because values within a column are similar in type and often in value. Compression techniques: Dictionary encoding: replace repeated string values with integer codes (e.g., country names: USA=1, UK=2, Germany=3). Reduces high-cardinality string columns by 5-10x. Run-length encoding (RLE): store (value, count) pairs for repeated consecutive values — ideal for sorted or near-sorted columns (status=active appears 10,000 times in a row). Delta encoding: store differences between consecutive values — ideal for timestamps (each row is +1 second). Bit-packing: pack small integers into fewer bits. Combined, these achieve 5-10x compression over uncompressed row data.

Vectorized Execution

Vectorized execution processes data in batches (vectors) of 1024-8192 values using SIMD CPU instructions (SSE, AVX). Instead of a row-at-a-time processing loop (one function call per row), the engine calls a function once per batch: sum(salary_column[0:8192]). SIMD instructions process 8 float64 values simultaneously in one CPU instruction. Combined with columnar storage (the column is already a contiguous array), vectorized execution is 10-100x faster than row-at-a-time processing for aggregation queries. ClickHouse, DuckDB, Apache Arrow, and Velox all use vectorized execution engines.

Key Interview Discussion Points

Zone maps / min-max indexes: each column chunk stores min/max values; queries with WHERE salary > 100000 can skip entire chunks where max < 100000 without decompressing
Sort order matters: sorting data by the most-frequently-filtered column (event_date) improves compression (same dates cluster) and enables efficient range scans
Parquet as an open columnar format: Apache Parquet is a columnar file format supported by Spark, Flink, BigQuery, Athena — the universal format for data lakes
HTAP (Hybrid Transactional/Analytical Processing): systems like TiDB and SingleStore serve both OLTP and OLAP from one database by maintaining both row and columnar storage
Projection in Redshift: columnar distribution key (how data is sharded across nodes) is critical for join performance — distribute on the join key to avoid data movement