Low Level Design: Columnar Database Internals

Q: Why is columnar storage better for analytics queries?

Analytics queries (OLAP) typically read a few columns across millions of rows — for example, SELECT sum(revenue) FROM orders WHERE date > '2024-01-01'. Row storage reads entire rows including irrelevant columns, wasting I/O. Columnar storage reads only the needed columns, reducing I/O by 10-100x. Additionally, a column contains values of the same type and often similar values, enabling much higher compression ratios. Decompression is faster than the I/O savings, making columnar consistently faster for analytics.

⏱ 11 min read

Columnar databases power modern analytics workloads — data warehouses, BI dashboards, log analytics, and time-series aggregations. Understanding their internals helps you make the right storage and query engine choices, and explains why scanning billions of rows for a few columns can be faster than reading a fraction of the row-oriented equivalent.

Row Storage vs Columnar Storage

In a traditional row-oriented database (PostgreSQL, MySQL), each row is stored contiguously on disk: [id=1, name="Alice", age=30, salary=90000] [id=2, name="Bob", age=25, salary=75000]. Reading any column means reading all columns in that row — you pay full I/O even when your query only needs one field.

In a columnar store, each column is stored contiguously: [1,2,3,...] ["Alice","Bob",...] [30,25,...] [90000,75000,...]. A query computing AVG(salary) reads only the salary column and skips every other column entirely. For a table with 100 columns and a query touching 3, that is up to a 33x reduction in I/O.

OLTP workloads (transactional systems) read and write full rows frequently — row stores win there. OLAP workloads (analytical queries) scan few columns across millions of rows — columnar stores win there. This is why you see PostgreSQL for your application database and Snowflake or ClickHouse for your data warehouse.

Columnar Compression Techniques

Storing values of the same type and often the same domain together makes compression extremely effective — orders of magnitude better than row-oriented general-purpose compression.

Run-length encoding (RLE): replace consecutive repeated values with (value, count) pairs. A status column with 10,000 consecutive "active" records becomes a single pair. Effective on sorted or low-cardinality columns.
Dictionary encoding: assign integers to unique string values; store only integers in the column. A country column with 200 distinct values stored across 10 million rows stores one dictionary of 200 strings and 10 million small integers instead of 10 million variable-length strings.
Delta encoding: store differences between consecutive values rather than absolute values. Timestamps in a time series increment by small amounts — storing deltas produces tiny values that compress further with bitpacking.
Bitpacking: if a column’s values fit in fewer than 64 bits (e.g., ages 0–150 fit in 8 bits), store them packed. A column of 64-bit integers representing values 0–255 can be stored in 1/8 the space.
Frame of Reference (FOR): subtract a base value from all values in a block, then bitpack the smaller residuals.

These techniques compound. A timestamp column might be dictionary + delta + bitpacked, achieving 10–50x compression versus raw storage.

Parquet File Format

Parquet is the dominant open columnar file format for data lakes (Hadoop, S3). It is self-describing (schema embedded in file footer) and designed for efficient partial reads.

Structure from outer to inner:

Row groups: the file is divided into horizontal slices of ~128MB each. A row group contains all columns for a subset of rows. Reading a row group loads data from disk sequentially.
Column chunks: within a row group, each column’s data is stored together. A query reading two columns from one row group seeks to each column chunk independently.
Pages: column chunks are split into pages (~1MB) which are the unit of encoding and compression. Each page has a header with value count and encoding metadata.
File footer: contains the schema, row group metadata, and column statistics. Read first to understand file layout without reading data pages.

Column statistics in the footer (min value, max value, null count per column chunk) enable predicate pushdown: WHERE date = '2024-01-15' can skip entire row groups whose date range doesn’t include that date, without reading any data pages.

Zone Maps and Min-Max Indexes

A zone map (also called a min-max index or small materialized aggregate) stores the minimum and maximum value of each column in a storage unit (row group, data part, or block). Before reading any data, the query engine checks the zone map: if the predicate value is outside [min, max], the entire unit is skipped.

Example: a table partitioned by day, with each day’s data sorted by user_id. A query for user_id = 12345 can skip any data part whose min_user_id > 12345 or max_user_id < 12345. If data is well-sorted and predicate is selective, zone maps alone eliminate 90%+ of I/O.

Zone maps are trivially cheap to maintain (just track min/max during writes) and require no extra index structures. They are used pervasively in Parquet, ClickHouse MergeTree, Snowflake micro-partitions, and Delta Lake.

Vectorized Execution

Traditional row-at-a-time query execution calls operator functions once per row — the interpreter overhead (function call, type dispatch, null checks) dominates for simple operations. Vectorized execution processes batches of 1,024–8,192 values per operator call instead.

Benefits:

SIMD utilization: modern CPUs have AVX-512 instructions that operate on 512 bits simultaneously. A batch of 16 32-bit integers can be compared or added in a single CPU instruction.
CPU cache efficiency: a vector of 1,024 integers fits in L1 cache. Processing it entirely before moving to the next vector avoids cache misses.
Reduced interpreter overhead: function call overhead is amortized over 1,024 rows instead of paid once per row.
Branch prediction: tight loops over homogeneous data are highly predictable.

DuckDB and ClickHouse are built from the ground up for vectorized execution. MonetDB pioneered the approach. Vectorized systems can be 10–100x faster than row-at-a-time interpreters on analytical queries.

Late Materialization

Early materialization reconstructs full rows as soon as possible and passes them through the pipeline. Late materialization defers row reconstruction until after all predicates and projections have been applied.

Process: apply predicate on compressed column A → get a bitset of matching row positions → apply predicate on column B using only those positions → intersect bitsets → only then fetch values from output columns for the surviving rows.

Late materialization means you never decompress columns for rows that will be filtered out. On selective queries this massively reduces decompression work. It also keeps columns in their compressed form longer, allowing operations like equality checks directly on dictionary codes without decoding to strings.

Partition Pruning

Partitioning divides a table into independent physical segments based on a partition key (typically date). Each partition is stored as a separate set of files. A query with a filter on the partition key can skip partitions entirely — not just skip pages within a file, but avoid opening the files at all.

A table partitioned by day with 3 years of data contains ~1,095 partitions. A query for the last 7 days reads 7 partitions and ignores 1,088. This happens before any I/O beyond listing partition directories. Combined with zone maps inside partitions, a highly selective query might read less than 0.1% of total data.

Partition pruning requires the query to include a predicate on the partition key that can be evaluated statically. Dynamic partition pruning (common in Spark and modern query planners) extends this by evaluating the predicate as part of query planning, even when the value comes from a subquery or join.

ClickHouse: MergeTree Engine

ClickHouse is an open-source OLAP database designed for sub-second queries over billions of rows. Its core storage engine is MergeTree.

Data parts: inserts create immutable parts on disk (like LSM-tree SSTables). Background merges combine small parts into larger ones.
Primary index (sparse): not a B-tree but a sparse index stored in memory. Every 8,192 rows (one "granule") stores one index entry. Lookup finds the range of granules to read, not an exact row. Dramatically smaller than a dense B-tree index.
Sorting key: data within each part is sorted by the primary key. This maximizes RLE compression effectiveness and zone map selectivity.
Columnar storage: each column stored in its own .bin file with .mrk mark files for granule offsets.
Materialized views and projections: pre-aggregate data on write for common query patterns.

DuckDB and Snowflake

DuckDB is an in-process OLAP database (like SQLite but for analytics). It runs inside your application process with no server, reads Parquet natively from disk or S3, and uses vectorized execution with SIMD. Ideal for local analytics, ETL scripts, and notebook workflows. A single DuckDB process can query 100GB Parquet files faster than many cluster-based systems because it avoids network overhead entirely.

Snowflake separates storage (S3) from compute (virtual warehouses). Data is stored in micro-partitions — 50–500MB immutable compressed columnar files with automatic zone maps. Snowflake clusters data within micro-partitions automatically based on query patterns (automatic clustering). Compute nodes cache micro-partitions locally on SSD. Scale compute up or down without touching storage. The architecture eliminates capacity planning: storage and compute scale independently.

Interview Checklist

Explain why columnar storage is better for OLAP: only read needed columns, better compression
Describe RLE, dictionary encoding, delta encoding, and bitpacking with concrete examples
Explain Parquet row groups, column chunks, and footer statistics
Describe zone maps and how they enable predicate pushdown without a traditional index
Explain vectorized execution: batches of 1024+ values, SIMD, cache efficiency
Describe late materialization: filter on compressed columns, reconstruct rows last
Explain partition pruning: skip entire partitions based on partition key predicates
Know ClickHouse MergeTree: sparse primary index, granules, data parts and merges
Know DuckDB for in-process analytics and Snowflake for cloud-native separation of storage and compute

Frequently Asked Questions

Why is columnar storage better for analytics queries?

Analytics queries (OLAP) typically read a few columns across millions of rows — for example, SELECT sum(revenue) FROM orders WHERE date > ‘2024-01-01’. Row storage reads entire rows including irrelevant columns, wasting I/O. Columnar storage reads only the needed columns, reducing I/O by 10-100x. Additionally, a column contains values of the same type and often similar values, enabling much higher compression ratios. Decompression is faster than the I/O savings, making columnar consistently faster for analytics.

How does dictionary encoding work in columnar databases?

Dictionary encoding replaces repeated string values with integer codes. The dictionary maps each unique string to an integer (e.g., country name to 0-200). The column stores integers instead of strings. Operations like filtering (WHERE country = ‘US’) operate on the integer dictionary without decompressing string values. This is especially effective for low-cardinality columns (gender, country, status) where a small dictionary covers many rows. ClickHouse and Parquet both use dictionary encoding.

What is vectorized execution and how does it improve query performance?

Vectorized execution processes batches of values (vectors of 1024+ elements) rather than one row at a time. Each operator receives a batch, applies an operation using SIMD (Single Instruction Multiple Data) CPU instructions, and passes the result batch to the next operator. This improves CPU cache utilization (data fits in L1/L2 cache), enables SIMD parallelism (process 4-16 values per clock cycle), and reduces per-row interpretation overhead. DuckDB and ClickHouse use vectorized execution.

What is late materialization in a columnar query engine?

Late materialization defers reconstructing full rows until as late as possible in the query plan. The engine first evaluates predicates on individual compressed columns, producing a selection vector (bitset of matching row IDs). Only after filtering does it fetch values from other columns for the matching rows. This avoids decompressing and reading data for rows that will be filtered out. For highly selective queries (1% selectivity), late materialization can be 10x faster than eagerly assembling rows.

How does Parquet support predicate pushdown?

Parquet stores min/max statistics for each column chunk (group of rows). When a query has a predicate like WHERE date > ‘2024-06-01’, the query engine reads only the column chunk statistics (from the file footer) and skips entire row groups where max(date) <= '2024-06-01'. This zone map (or min-max index) pruning avoids decompressing and reading data pages for eliminated row groups. For sorted or clustered data, this can skip the vast majority of the file.