System Design Interview: Analytics Platform (ClickHouse / Snowflake Scale)

⏱ 7 min read

Analytics vs Transactional Systems

OLTP (Online Transaction Processing) systems like PostgreSQL are optimized for short, frequent reads and writes to individual rows — user profiles, orders, payments. OLAP (Online Analytical Processing) systems are optimized for aggregate queries across billions of rows — “what was total revenue by country last quarter?” These workloads are fundamentally different: OLTP needs row-level access; OLAP needs to scan entire columns and aggregate. Columnar storage (ClickHouse, BigQuery, Snowflake, Redshift) stores each column as a contiguous byte sequence, enabling CPU-efficient sequential scans and high compression ratios. A query summing one column in a 100-column table reads only 1% of the data.

Columnar Storage Internals

Row storage: [user1_id, user1_name, user1_age, user2_id, user2_name, user2_age, …]. Columnar storage: [user1_id, user2_id, …][user1_name, user2_name, …][user1_age, user2_age, …]. Advantages: (1) Compression: values in the same column are homogeneous (all integers, all dates) and often low-cardinality — dictionary encoding can compress a VARCHAR country column from 8 bytes to 1 byte per row. (2) Vectorized execution: the query engine processes column chunks using SIMD CPU instructions, computing aggregates on 256 or 512 values per instruction instead of one at a time. ClickHouse processes 1-10 billion rows per second on a single machine using this approach. (3) Column pruning: queries that touch 3 of 100 columns read 97% less I/O.

Data Ingestion Architecture

Analytics platforms receive data from multiple sources: application databases (via CDC — Change Data Capture using Debezium on Postgres binlog), event streams (Kafka topics from web/app events), and batch files (S3, GCS). The ingestion pipeline: raw events land in Kafka, a Flink job does lightweight validation and enrichment (user ID lookup, currency conversion), then writes to the analytics store in micro-batches. Micro-batch size is a latency-throughput tradeoff: smaller batches = lower latency but more overhead; ClickHouse recommends batches of at least 1,000 rows. For sub-minute latency analytics, use a streaming buffer layer: new events go into a fast in-memory store (Redis, DynamoDB) and are merged with historical data from the columnar store at query time.

Pre-aggregation and Materialized Views

Ad hoc queries across billions of rows are too slow for dashboards (10+ seconds vs sub-second requirement). Pre-aggregate at write time: maintain materialized views that roll up raw events into hourly and daily summaries. A dashboard showing “daily active users over 90 days” queries 90 pre-aggregated rows instead of 90 * millions of raw events. ClickHouse SummingMergeTree engine automatically merges rows with the same sort key by summing their numeric columns — perfect for pre-aggregated metrics. Tradeoff: pre-aggregation at fixed granularities loses flexibility for ad hoc queries. Lambda architecture addresses this: a batch layer (daily aggregates) handles historical data at scale; a speed layer (real-time stream) handles recent data at lower granularity. Queries merge results from both layers.

Partitioning and Indexing in ClickHouse

ClickHouse uses partition by date and a primary sort key (ORDER BY) for fast range queries. Data is physically sorted by the ORDER BY columns, making range scans on those columns extremely efficient. Primary index: a sparse index stores one entry per 8192 rows (granule). A query on the primary key column narrows to the relevant granules without a full table scan. Secondary data skipping indexes: bloom filter indexes on non-primary columns allow skipping entire data parts when a column value is provably absent. Partitioning by date (PARTITION BY toYYYYMM(event_date)) enables dropping an entire month partition in milliseconds instead of deleting rows. Old partitions are dropped or moved to cold storage (S3-backed storage) automatically via TTL policies.

Multi-Tenant Analytics

Snowflake and similar platforms serve multiple customers from shared infrastructure. Isolation is achieved with virtual warehouses — each tenant or workload gets a compute cluster that scales independently of storage. Storage (S3) is shared but encrypted per-tenant; compute is separate. Query quotas and resource pools prevent noisy-neighbor problems: a slow, expensive query from one tenant does not affect others. Row-level security (RLS) ensures tenants see only their own data even in shared tables. This architecture enables compute/storage separation: storage costs are low and shared; compute scales to zero when idle and scales up on demand, charged per second.

Interview Tips

Columnar storage is the first thing to mention — know WHY (compression, vectorization, column pruning)
Distinguish OLTP vs OLAP clearly — interviewers test whether you know when to use each
Pre-aggregation vs ad hoc: explain the tradeoff, mention materialized views
Lambda architecture (batch + speed layers) is the standard answer for mixed latency requirements
For multi-tenant: compute/storage separation is the key insight from Snowflake

Frequently Asked Questions

What is the difference between OLTP and OLAP databases?

OLTP (Online Transaction Processing) databases like PostgreSQL and MySQL are optimized for high-frequency reads and writes to individual rows — user registrations, order placements, payment processing. They use row-oriented storage, normalized schemas, and B-tree indexes for point lookups. OLAP (Online Analytical Processing) databases like ClickHouse, Snowflake, and BigQuery are optimized for aggregate queries across billions of rows — total revenue by product category, DAU trend over 12 months. They use columnar storage where each column is stored contiguously, enabling vectorized processing and high compression ratios. OLAP queries touch a few columns out of hundreds, reading only the relevant data. The performance tradeoff is stark: a PostgreSQL table scan of 1 billion rows takes hours; ClickHouse aggregates the same data in seconds.

How does columnar storage improve query performance?

Columnar storage stores each column as a separate contiguous byte sequence on disk, rather than storing complete rows together. This improves query performance in three ways: (1) Column pruning: a query that selects 3 of 100 columns reads only 3% of the data on disk. Row storage always reads full rows, even for unused columns. (2) Compression: values in the same column are homogeneous (all integers, all timestamps, all country codes) and often low-cardinality. Dictionary encoding for a country column reduces 8-byte strings to 1-byte codes; run-length encoding for status columns stores repeated values with extreme efficiency. Typical compression ratios are 5-10x vs row storage. (3) Vectorized execution: the query engine reads column chunks into CPU cache and applies SIMD instructions that process 256 or 512 values per instruction cycle, computing aggregates orders of magnitude faster than row-at-a-time processing.

What is a Lambda architecture and when do you use it?

Lambda architecture is a data processing pattern that handles both historical batch data and real-time streaming data by running two parallel pipelines: (1) Batch layer: periodically (hourly or daily) reprocesses the complete historical dataset to produce accurate, fully corrected aggregates. Handles late-arriving data and data corrections. High latency but high accuracy. (2) Speed layer: processes real-time data as it arrives, producing approximate or partial aggregates for recent data. Low latency but may be slightly inaccurate (missing late events). Query layer: merges results from both layers — historical data from the batch layer, recent data from the speed layer. Lambda architecture is used when you need both sub-second query latency for recent data AND accurate historical aggregates — common in analytics dashboards, fraud detection, and real-time monitoring. Its main drawback is operational complexity: two codebases for the same logic.