System Design: Design Real-Time Analytics Platform — Clickstream Processing, Funnel Analysis, Cohort Analysis, OLAP

Real-time analytics platforms like Amplitude, Mixpanel, and Google Analytics process billions of events per day to provide instant insights: funnels, cohorts, retention, and custom queries. Designing an analytics platform tests your understanding of event ingestion at scale, real-time aggregation, OLAP query patterns, and the storage engines that make sub-second analytics possible on petabytes of data.

Event Ingestion Pipeline

Analytics events flow from client SDKs to the analytics backend. Event format: {event_type: “page_view”, user_id: “u123”, timestamp: “2026-04-20T15:30:00Z”, properties: {page: “/pricing”, referrer: “google.com”, device: “mobile”, country: “US”}}. Ingestion pipeline: (1) Client SDK (JavaScript, iOS, Android) batches events locally and sends them to the ingestion API every 30 seconds or when the batch reaches 10 events. (2) The ingestion API validates events (schema validation, required fields), enriches them (IP geolocation -> country/city, user agent parsing -> device/browser), and publishes to Kafka. (3) Kafka partitions by user_id for per-user ordering. Retention: 7 days for replay. (4) A stream processor (Flink) consumes from Kafka for: real-time aggregations (event counts per minute for live dashboards), sessionization (group events into user sessions based on 30-minute inactivity timeout), and funnel step matching (identify which funnel step each event corresponds to). (5) Events are written to the analytics database (ClickHouse, Druid, or BigQuery) for ad-hoc queries. Scale: a mid-size SaaS product generates 100M events/day. A large platform (Spotify, Uber): 10B+ events/day. The ingestion API must handle 100K+ events/sec at peak. Kafka and ClickHouse handle this throughput with commodity hardware.

Analytics Query Patterns

Analytics queries differ from transactional queries: they scan millions of rows, aggregate by dimensions, and filter by time ranges. Core query types: (1) Event count with breakdown — “How many page views per day, broken down by country, for the last 30 days?” SELECT date, country, COUNT(*) FROM events WHERE event_type = “page_view” AND timestamp >= 30_days_ago GROUP BY date, country. (2) Funnel analysis — “What percentage of users who viewed the pricing page also started a trial and then subscribed?” This requires: finding users who performed step 1 (page_view /pricing), then checking which of those also performed step 2 (trial_start), and which of those also performed step 3 (subscription). The steps must occur in order within a time window (e.g., 7 days). (3) Cohort analysis — “For users who signed up in January, what is their retention rate by week?” Group users by signup date (cohort), then for each cohort compute the percentage active in week 1, week 2, …, week 12. (4) User segmentation — “How many users performed action X more than 3 times in the last 7 days?” Requires per-user aggregation: GROUP BY user_id HAVING COUNT(*) > 3, then COUNT the resulting users. These queries scan millions to billions of rows. Sub-second response times require a columnar OLAP database optimized for aggregate queries.

Storage: ClickHouse and OLAP Engines

ClickHouse (Yandex, open-source) is the most popular self-hosted OLAP database for analytics. Why ClickHouse for analytics: (1) Columnar storage — reads only the columns needed for the query. A query on event_type and country reads 2 columns, not the entire 20-column row. (2) Compression — columns with similar values compress extremely well. event_type with 50 unique values across 1B rows compresses to a fraction of raw size. LZ4 compression provides 5-10x compression with near-zero CPU overhead. (3) Vectorized execution — processes data in batches using SIMD instructions. Millions of rows per second per core. (4) MergeTree engine — the primary storage engine. Data is sorted by a primary key (typically timestamp + dimensions). Range queries on the sort key are extremely fast (skip irrelevant data blocks). Partitioning by date enables efficient time-range queries and data lifecycle management. (5) Materialized views — pre-aggregate data at ingestion time. When an event is inserted into the raw table, a materialized view automatically updates an aggregation table (events per hour per country). Dashboard queries read the pre-aggregated table: instant response. Alternatives: Apache Druid (real-time ingestion + historical), Apache Pinot (LinkedIn, real-time OLAP), BigQuery (serverless, pay-per-query), and Snowflake (for batch analytics). For real-time dashboards: ClickHouse or Druid. For batch/ad-hoc: BigQuery or Snowflake.

Funnel and Cohort Implementation

Funnel implementation in ClickHouse: windowFunnel() function. This function takes a time window and a series of conditions, and for each user computes how far they progressed through the funnel. SELECT level, COUNT() FROM (SELECT user_id, windowFunnel(604800)(timestamp, event = “page_view”, event = “trial_start”, event = “subscribe”) AS level FROM events WHERE timestamp >= 30_days_ago GROUP BY user_id) GROUP BY level. This returns: level 0 (did not start), level 1 (page view only), level 2 (page view + trial), level 3 (complete funnel). Conversion rates: level_N_count / level_1_count. Cohort retention in ClickHouse: (1) Determine each user cohort (signup week): first event timestamp per user. (2) For each subsequent week: count active users from each cohort. (3) Divide by cohort size for retention rate. This requires a self-join: events table joined with a user_cohorts table (user_id, cohort_week). Pre-compute user_cohorts as a materialized view for performance. Real-time funnels: for live dashboards showing funnel conversion updating every minute, maintain a streaming funnel computation in Flink. For each user: track their current funnel state. On each new event: advance the state if the event matches the next funnel step. Emit the updated funnel distribution to a real-time dashboard via WebSocket.

Multi-Tenancy and Query Isolation

An analytics SaaS serves thousands of customers. Each customer has their own events, dashboards, and queries. Isolation: (1) Data isolation — all events include an org_id column. Every query includes WHERE org_id = X. A customer cannot access another customer events. In ClickHouse: partition by org_id for efficient per-customer queries and data lifecycle management (delete an org data by dropping partitions). (2) Query isolation — a single customer heavy query should not degrade other customers experience. ClickHouse settings: max_execution_time (kill queries running longer than 30 seconds), max_memory_usage (per-query memory limit), and max_concurrent_queries (per-user concurrency limit). Query queuing: if a customer exceeds their concurrency limit, queue additional queries and execute when a slot opens. (3) Storage isolation — per-customer storage quotas. Track ingested event volume per org. Alert and throttle when approaching the limit. Tiered pricing: free (100K events/month), growth (10M events/month), enterprise (unlimited with dedicated resources). (4) Dedicated instances — for enterprise customers: a dedicated ClickHouse cluster with guaranteed resources. No noisy-neighbor risk. Higher cost but guaranteed performance SLA.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why is ClickHouse the preferred database for real-time analytics?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”ClickHouse excels at analytics queries for five reasons: (1) Columnar storage — reads only needed columns. A query on event_type and country reads 2 columns, not all 20. (2) Compression — homogeneous column data compresses 5-10x with LZ4. event_type with 50 unique values across 1B rows compresses dramatically. (3) Vectorized execution — processes data in SIMD batches. Millions of rows per second per core. (4) MergeTree engine — data sorted by primary key (timestamp + dimensions). Range queries skip irrelevant blocks. Partitioning by date enables efficient time-range queries. (5) Materialized views — pre-aggregate at ingestion. Events per hour per country computed on insert. Dashboard queries read pre-aggregated tables: instant response. A 10B-event dataset that takes 30 seconds to query in PostgreSQL returns in under 1 second in ClickHouse.”}},{“@type”:”Question”,”name”:”How do you implement funnel analysis at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Funnel: what percentage of users completed each step in order within a time window? ClickHouse windowFunnel() function: for each user, compute how far they progressed. SELECT windowFunnel(7_days)(timestamp, event=page_view, event=trial_start, event=subscribe) AS level FROM events GROUP BY user_id. Returns level 0-3 per user. Aggregate: GROUP BY level, COUNT. Conversion = level_N / level_1. For real-time funnels (live dashboard updating every minute): Flink streaming computation. Track per-user funnel state. On each event: advance if it matches the next step. Emit updated funnel distribution via WebSocket. Cohort retention: group users by signup week, compute percentage active in week 1, 2, …, 12. Requires joining events with a user_cohorts table. Pre-compute cohorts as a materialized view for performance.”}}]}
Scroll to Top