Cohort Analysis System Low-Level Design: User Grouping, Retention Funnel, and Time-Series Metrics

Requirements and Constraints

A cohort analysis system groups users by a shared characteristic or event at a point in time (the cohort definition) and then tracks behavioral metrics for those groups over subsequent time periods. Classic examples include day-0 retention curves (what fraction of users who signed up in week N were still active in week N+k?) and revenue cohorts (how much did users acquired in Q1 contribute in each subsequent quarter?).

Functional requirements include defining cohorts by event type, time range, user properties, or computed segment; computing retention funnels with configurable time buckets; tracking arbitrary metrics per cohort over time; and supporting both pre-computed cohort tables for dashboards and ad-hoc queries for exploration.

Non-functional requirements: Dashboard queries must complete in under 3 seconds for up to 12 months of retention data across millions of users. Ad-hoc cohort queries must complete in under 60 seconds. Cohort definitions must be replayable — adding a new metric to an existing cohort should not require reprocessing all raw events.

Scale Assumptions

50 million users; 10 million new users per month (cohort size)
Raw event volume: 1 billion events per day
Cohort retention computed at daily, weekly, and monthly granularities
Funnel steps: up to 10 events in sequence

Core Data Model

-- Cohort definitions
CREATE TABLE cohorts (
  id              UUID PRIMARY KEY,
  name            VARCHAR(256) UNIQUE NOT NULL,
  description     TEXT,
  entry_event     VARCHAR(128) NOT NULL,  -- event that qualifies a user for this cohort
  entry_property  JSONB,                  -- filter on event properties
  time_window_start TIMESTAMPTZ NOT NULL,
  time_window_end   TIMESTAMPTZ NOT NULL,
  created_by      VARCHAR(128),
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Cohort membership (computed once per cohort definition)
CREATE TABLE cohort_members (
  cohort_id       UUID NOT NULL REFERENCES cohorts(id),
  user_id         VARCHAR(256) NOT NULL,
  cohort_date     DATE NOT NULL,  -- the date/period bucket the user belongs to
  first_event_at  TIMESTAMPTZ NOT NULL,
  PRIMARY KEY (cohort_id, user_id)
);

-- Pre-aggregated retention metrics
CREATE TABLE cohort_retention (
  cohort_id       UUID NOT NULL,
  cohort_period   DATE NOT NULL,   -- e.g. 2025-01-01 for the Jan 2025 weekly cohort
  period_offset   SMALLINT NOT NULL, -- 0=day0, 1=day1, 7=week1, etc.
  granularity     VARCHAR(16) NOT NULL,  -- day, week, month
  active_users    BIGINT NOT NULL,
  cohort_size     BIGINT NOT NULL,
  retention_rate  DOUBLE PRECISION NOT NULL,  -- active_users / cohort_size
  metric_sum      DOUBLE PRECISION,           -- e.g. total revenue for this bucket
  metric_avg      DOUBLE PRECISION,
  PRIMARY KEY (cohort_id, cohort_period, period_offset, granularity)
);

-- Funnel step definitions
CREATE TABLE funnel_steps (
  id              UUID PRIMARY KEY,
  cohort_id       UUID NOT NULL REFERENCES cohorts(id),
  step_order      SMALLINT NOT NULL,
  event_name      VARCHAR(128) NOT NULL,
  property_filter JSONB,
  max_interval_seconds BIGINT  -- max time from previous step (NULL = no limit)
);

Cohort Computation Pipeline

Cohort membership is computed by a batch job triggered when a cohort definition is created or updated. The job runs on Spark or BigQuery and executes in two phases:

Phase 1 — Member identification: Scan the raw events table filtered to the cohort's time window and entry event type. Apply property filters (e.g., country = “US”, plan = “pro”). Select the first qualifying event per user. Insert results into cohort_members with the cohort_date bucketed to the desired granularity (e.g., truncate to ISO week).

Phase 2 — Retention computation: Join cohort_members against the full events table for the return event (typically any active session event, or a specific conversion event). For each member, determine the offset in days/weeks/months between their first_event_at and each subsequent qualifying event. Group by (cohort_id, cohort_period, period_offset, granularity) and count distinct users. Write results to cohort_retention.

Both phases use partition pruning on the events table (partitioned by event date) to limit scan range. Incremental updates append new period offsets daily without reprocessing historical offsets.

Funnel Computation

Funnel analysis requires ordered event matching with time constraints between steps. For each cohort member, the algorithm processes their event stream and applies a greedy left-to-right match:

Find the earliest occurrence of step 1 event at or after first_event_at.
Find the earliest occurrence of step 2 event after step 1, within max_interval_seconds.
Continue for each subsequent step.
Record how far each user progressed through the funnel.

In SQL this is implemented with window functions and self-joins, or more efficiently with a Spark UDF that processes each user's sorted event array. Funnel results are aggregated per (cohort_period, step_order) and stored for dashboard rendering.

Time-Series Metric Aggregation

Beyond retention (binary active/inactive), cohorts track continuous metrics like revenue, session length, and feature adoption. For each cohort member and each period offset, the pipeline computes:

Sum, count, and sum-of-squares of the metric value for variance computation
Distinct user count (for average-per-user calculations)
Percentiles (p50, p90, p99) using approximate algorithms (t-digest or KLL sketch)

Sketch data (t-digest centroids) is stored in the cohort_retention table's metric_sum column as a serialized blob, enabling merging of sketches when rolling up daily cohort periods into weekly or monthly views without reprocessing raw events.

API Design

POST /v1/cohorts                  -- define cohort (triggers async computation)
GET  /v1/cohorts/{id}/status      -- computation job status
GET  /v1/cohorts/{id}/retention   -- retention curve, ?granularity=week&metric=revenue
GET  /v1/cohorts/{id}/funnel      -- funnel drop-off per cohort period
GET  /v1/cohorts/{id}/timeseries  -- arbitrary metric time series per cohort period
POST /v1/cohorts/compare          -- overlay retention curves for multiple cohorts

Scalability Considerations

Pre-aggregation: Dashboard queries read only from cohort_retention (hundreds of rows per cohort), never from raw events. Query latency is dominated by the join across a small result set.
Ad-hoc queries: Run against a columnar store (BigQuery, ClickHouse, or Redshift) with the raw events table. Partition pruning on event date limits scan to the relevant window. Materialized cohort membership tables eliminate the re-identification scan.
Incremental updates: Each nightly job appends only new period offsets (e.g., day 30 for cohorts that are 30 days old). It never rewrites existing offsets unless the cohort definition changes.
Cohort size limits: Very large cohorts (all users in a month = 10M rows) are processed in sharded Spark jobs. The cohort_members table is indexed on (cohort_id, cohort_date) for efficient per-period retrieval during retention computation.
Late event handling: Events arriving more than 24 hours late trigger a re-computation of the affected period offset using a merge-insert pattern (INSERT ... ON CONFLICT DO UPDATE) rather than a full reprocess.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you define cohorts and group users in a cohort analysis system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A cohort is defined by a shared attribute at a specific point in time—typically the period of first action such as sign-up week, first purchase month, or feature adoption date. The system assigns each user to exactly one cohort by evaluating the defining event against event log records and writing the (user_id, cohort_key) mapping to a cohort membership table. Cohort definitions are stored as versioned rules so historical analyses remain stable when definitions evolve.”
}
},
{
“@type”: “Question”,
“name”: “How do you compute a retention funnel with window functions for cohort analysis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Retention is computed by joining each user's activity events with their cohort membership record, then using window functions to identify the first occurrence of each funnel step within a configurable time window relative to cohort entry. A DATE_DIFF between the cohort date and each return event date is bucketed into day-0, day-1, day-7, and day-30 columns. Aggregating counts and dividing by cohort size yields the retention matrix without a self-join per time bucket.”
}
},
{
“@type”: “Question”,
“name”: “How does t-digest enable percentile estimation in a cohort analysis system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A t-digest is a compact sketch that represents a distribution as a sorted list of weighted centroids. When computing p50, p95, or p99 latency over large cohorts, each shard builds a local t-digest from its records; shards merge their digests by concatenating centroid lists and re-compressing. The resulting merged digest answers percentile queries with bounded relative error, using far less memory and network bandwidth than sorting the full dataset—critical for nightly cohort jobs that span billions of events.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement incremental nightly computation for cohort analysis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Incremental computation partitions the cohort table by cohort_key and the activity table by event_date. Each nightly job appends only yesterday's events to a delta partition rather than reprocessing the full history. Retention aggregates for mature cohorts (older than the maximum retention window) are checkpointed as frozen summary rows so they are never recomputed. A dependency graph ensures the delta job runs only after upstream event ingestion completes, and idempotency is enforced by overwriting the target partition for the current date before committing.”
}
}
]
}