Analytics Data Pipeline System Design

Analytics pipelines move raw events from user actions into queryable dashboards. The core challenge: volume is enormous (billions of events per day at scale), latency requirements vary (real-time counters vs. nightly reports), and correctness matters (double-counting ad clicks costs money). This design walks through each layer.

Event Ingestion

Clients (web, mobile, server) send events through a thin SDK. The SDK batches events locally (typically 50 events or 5 seconds, whichever comes first) and POSTs them to an ingestion service over HTTPS. Key concerns at this layer:

Schema registry: Events are serialized with Avro or Protobuf. The schema registry (Confluent Schema Registry or AWS Glue) enforces compatibility. Producers register schemas; consumers look them up by schema ID embedded in each message. This prevents schema drift from breaking downstream consumers.
Deduplication: Clients assign a UUID event_id to each event before batching. The ingestion service writes to Kafka with event_id as the idempotency key. A Redis set (with 24-hour TTL) or a Bloom filter rejects duplicates at the gateway before they reach Kafka.
Kafka topics: Partition by user_id or session_id to keep related events ordered. Replication factor 3, retention 7 days (serves as the replayable source of truth).
Backpressure: If Kafka is slow, the ingestion service returns HTTP 429. The SDK retries with exponential backoff. Events are buffered in local storage (IndexedDB on web, SQLite on mobile) to survive network outages.

Stream Processing

Kafka Streams or Apache Flink reads from Kafka topics and computes real-time aggregations. Examples: count clicks per page per minute, compute 99th percentile latency per API endpoint per 5-minute window, detect anomalies (spike in error rates).

Windowed aggregations: Tumbling windows (non-overlapping, e.g., 1-minute buckets), sliding windows (overlapping, e.g., last 5 minutes updated every 1 minute), session windows (gap-based, close after N minutes of inactivity).
Stateful operators: Joins (enrich click events with user profile data from a side input), aggregations (sum, count, HyperLogLog for unique counts), and CEP (complex event processing for detecting sequences like “add to cart then abandon”).
Exactly-once semantics: Flink’s checkpointing + Kafka transactions. Without this, a worker crash causes duplicate processing. At-least-once is acceptable if the sink is idempotent.
Output sinks: Stream processor writes to Redis (real-time counters), Druid or ClickHouse (OLAP queries), and a compacted Kafka topic (for downstream consumers).

Batch ETL

Nightly Spark jobs read from Kafka (via Kafka connect to S3/GCS) or directly from object storage. They perform heavy transformations – sessionization, attribution modeling, cohort analysis – and load results into the data warehouse.

Idempotency: Partition output by date. On rerun, overwrite the partition. Never append. This makes reruns safe.
Partition pruning: Store raw events in Parquet partitioned by date/event_type. Spark reads only relevant partitions, dramatically cutting I/O.
Shuffle optimization: Broadcast small dimension tables (users, pages) to avoid expensive sort-merge joins. Use bucketing on high-cardinality join keys.
SLA: Batch jobs must complete before dashboards open in the morning. Alert if runtime exceeds 80% of the window.

Lambda Architecture

Lambda splits processing into two parallel paths that are merged at query time:

Speed layer: Stream processing (Flink/Kafka Streams) produces approximate real-time views. Low latency (seconds), but may miss late-arriving events or have minor inaccuracies.
Batch layer: Periodic full reprocessing of all historical data produces accurate, complete views. High latency (hours), but correct.
Serving layer: Merges speed layer and batch layer results at query time. When a fresh batch run completes, it supersedes the speed layer for that time range.
Tradeoff: You maintain two codebases implementing the same logic. Bugs often exist in one but not the other. Keeping them in sync is operationally expensive. This drove the adoption of Kappa.

Kappa Architecture

Kappa eliminates the batch layer. Everything goes through streaming. Kafka’s long retention (weeks or months) acts as the replayable log. When you need to reprocess historical data (e.g., to fix a bug or backfill a new metric), you spin up a new streaming job reading from the beginning of the topic, write to a new output table, and then swap the pointer in the serving layer.

When to choose Kappa: Your streaming framework (Flink) can handle batch workloads efficiently. You value operational simplicity over the marginal accuracy benefit of Lambda. Most modern pipelines at mid-scale use Kappa.

When to stick with Lambda: Your batch jobs use complex transformations that are hard to express in a streaming framework. Your SLA requires bit-perfect historical accuracy. You have dedicated teams for each layer.

Data Warehouse Schema Design

Analytics warehouses use dimensional modeling (Kimball). Facts are numeric measurements of business events. Dimensions are descriptive attributes.

-- Fact table: one row per event, foreign keys to dimensions
CREATE TABLE fact_events (
    event_id       BIGINT PRIMARY KEY,
    user_id        BIGINT,
    page_id        BIGINT,
    event_type     VARCHAR(64),  -- 'click', 'view', 'purchase'
    event_ts       TIMESTAMP,
    session_id     BIGINT,
    revenue_cents  INT           -- NULL if not a purchase
)
DISTKEY(user_id)
SORTKEY(event_ts);

-- Dimension table: slowly changing, denormalized for query speed
CREATE TABLE dim_users (
    user_id        BIGINT PRIMARY KEY,
    country        VARCHAR(2),
    signup_date    DATE,
    plan_type      VARCHAR(32),  -- 'free', 'pro', 'enterprise'
    is_active      BOOLEAN
);

CREATE TABLE dim_pages (
    page_id        BIGINT PRIMARY KEY,
    page_url       VARCHAR(2048),
    category       VARCHAR(64),
    author_id      BIGINT
);

Star schema: fact table at center, dimensions directly joined. One join per dimension. Snowflake schema: dimensions are normalized (e.g., dim_users references dim_countries). Star is faster for queries; snowflake saves storage and eases updates.

Column-Store vs Row-Store

Property	Row Store (Postgres, MySQL)	Column Store (Redshift, BigQuery, Snowflake, ClickHouse)
Storage layout	All columns of a row together	Each column stored separately
Analytics scan	Reads unused columns	Reads only queried columns
Compression	Moderate (mixed types per block)	High (same type per block, run-length encoding)
INSERT/UPDATE	Fast (single I/O to write row)	Slow (must update each column file)
Best for	OLTP, low-latency point lookups	OLAP, full-table scans with aggregations

Rule: if your query touches fewer than 20% of columns, a column store wins by a large margin. For transactional workloads (read a full user record by ID), a row store is better.

Real-Time Dashboards

Sub-second dashboard queries require a separate serving store optimized for aggregation:

Apache Druid / ClickHouse: Ingest from Kafka in real time. Pre-aggregate at ingest time (roll up to minute-level granularity). Support fast GROUP BY queries over billions of rows because data is pre-sorted by time and dimension.
Redis for simple counters: For high-write, simple metrics (page views per minute), use Redis INCR with a time-bucketed key like views:page_id:2024010115. Expire after 48 hours. Dashboard reads with MGET. Zero-latency reads, massive write throughput.
Materialized views: Pre-compute expensive joins (e.g., daily active users by country) and store results. Refresh on a schedule (every 5 minutes) or incrementally.
Query result caching: Cache dashboard query results in Redis with a 60-second TTL. Acceptable staleness for most metrics dashboards.

Data Quality

Schema validation on ingest: Reject events that fail schema validation at the gateway. Log rejections to a dead-letter queue for debugging. Never let malformed data pollute the pipeline silently.
Anomaly detection on metrics: Alert when a key metric drops more than 3 standard deviations below its rolling average. Common causes: SDK bug in a new app release, network partition preventing event delivery, bot traffic spike.
SLA monitoring on pipeline lag: Track consumer group lag in Kafka. Alert if the stream processor falls more than 5 minutes behind. Track Spark job duration and alert if it exceeds the SLA window.
Reconciliation: Periodically compare the stream layer totals with batch layer totals for the same time window. Discrepancy above a threshold triggers an alert and a manual review.

Scale Numbers

To anchor the design in reality:

Meta’s pipeline processes 100B+ events per day – roughly 1.2 million events per second at peak.
Kafka cluster: hundreds of brokers, petabytes of storage, millions of messages/second throughput.
Flink cluster: thousands of task managers, terabytes of managed state.
ClickHouse at Cloudflare: 10 trillion rows, 200TB of data, queries return in under a second.
A mid-sized SaaS (10M daily active users, 50 events per user): 500M events/day = 6K events/second. Manageable with a modest Kafka cluster (5 brokers), a small Flink deployment, and a single ClickHouse node.

The architecture scales horizontally at every layer: more Kafka partitions for ingestion throughput, more Flink parallelism for stream processing, more ClickHouse shards for query throughput. Start small and add capacity as metrics show bottlenecks.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between the Lambda and Kappa architectures?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Lambda architecture uses two processing paths: a batch layer for complete, accurate processing of all historical data, and a speed layer for real-time processing of recent data. The serving layer merges results from both. Advantage: batch layer can reprocess history when logic changes; speed layer provides low-latency views. Disadvantage: two codebases for the same logic – maintenance overhead and risk of divergence. Kappa architecture uses only a streaming pipeline. Historical reprocessing is done by replaying the log (e.g., Kafka with long retention) through the stream processor at high speed. Simpler to maintain (one codebase), but requires a replayable log and the stream processor must handle reprocessing at scale. Modern preference: Kappa when the stream processor is powerful enough; Lambda when batch processing is significantly cheaper or more reliable.”}},{“@type”:”Question”,”name”:”How do you ensure exactly-once event processing in a pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”True exactly-once requires two components: (1) Idempotent writes – applying the same event twice produces the same result as applying it once. Store a processed_events set keyed by event_id. Before processing, check if event_id exists. After processing, insert event_id. Use a DB transaction to make the check and insert atomic with the processing. (2) At-least-once delivery – guarantee the event is delivered (retry on failure). Exactly-once = idempotent processing + at-least-once delivery. For Kafka-based pipelines: Kafka Streams and Flink support transactional exactly-once by atomically committing offsets and output together. For simpler pipelines: idempotency key in the event payload is sufficient – process may receive duplicates but idempotent handling ensures correctness.”}},{“@type”:”Question”,”name”:”How does a columnar data warehouse differ from a row-oriented database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Row-oriented databases (Postgres, MySQL) store all columns of a row together on disk. Fast for transactional workloads (SELECT * FROM users WHERE id = 123 reads one row). Slow for analytics (SELECT COUNT(*) FROM events WHERE event_type = 'click' must scan the entire event_type column but loads all other columns too). Columnar databases (Redshift, BigQuery, Snowflake, ClickHouse) store each column separately. Analytics queries scan only the needed columns – 100x less I/O for wide tables. Compression is better per column (similar data values compress well). But row construction is expensive (SELECT * must read from N column files). Use columnar for: aggregations, GROUP BY, time-series analysis, historical reporting. Use row-oriented for: OLTP, point lookups, frequent small writes.”}},{“@type”:”Question”,”name”:”How do you design a real-time analytics dashboard that queries 1B+ events?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pre-aggregation is the key. Maintain materialized counters updated in real-time: INCR event_count:{page_id}:{minute_bucket} in Redis. Dashboard queries read pre-aggregated Redis counters – O(1) latency. For flexible ad-hoc queries: use a specialized OLAP store like Druid, ClickHouse, or Apache Pinot. These ingest events from Kafka in near-real-time (seconds), store in columnar format with pre-computed indexes, and answer aggregation queries over billions of rows in milliseconds. For historical analysis: Spark on Parquet files in S3. Architecture: events -> Kafka -> (Redis for real-time counters) + (ClickHouse for sub-second ad-hoc) + (Spark/S3 for historical). Choose the layer based on query latency requirements and data freshness needs.”}},{“@type”:”Question”,”name”:”How do you handle late-arriving events in a streaming pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Events may arrive out of order due to network delays, mobile offline buffering, or distributed system clock skew. If you process by event time (when the event occurred, not when it was received), late arrivals will fall into already-closed time windows. Strategies: (1) Watermarks: declare "all events with timestamp < T are now accounted for" and close windows when the watermark passes their end. Allow N minutes of lateness (watermark lag) to handle typical delays. (2) Allowed lateness: when a late event arrives for a closed window, update the window result and re-emit (if downstream can handle corrections). (3) Reprocessing: for very late events (hours late), batch-reprocess the affected time partitions during the nightly ETL to correct historical aggregations. Flink and Spark Streaming both support watermarks and allowed lateness natively.”}}]}

Databricks builds data pipelines and analytics infrastructure. See system design questions for Databricks interview: analytics pipeline and data engineering.

Snowflake interview covers data warehousing and analytics architecture. See design patterns for Snowflake interview: data warehouse and analytics system design.

Datadog ingests billions of metrics through analytics pipelines. See system design patterns for Datadog interview: metrics pipeline and analytics ingestion.