What is funnel analysis and how is it used to measure user conversion?

Funnel analysis measures the percentage of users who complete each sequential step in a defined workflow, such as landing page → product view → add to cart → checkout → purchase. By computing the drop-off rate between consecutive steps, product and growth teams identify where users abandon the flow and prioritize optimizations. A funnel is defined as an ordered sequence of event names with an optional conversion window (e.g., all steps must occur within 7 days). The funnel analysis service queries the event store to find, for each user, the earliest timestamp at which they completed each step in order within the window, then counts unique users at each step to produce a conversion funnel chart.

How does a funnel analysis service match event sequences with a conversion window?

A funnel analysis service matches event sequences by joining the event table to itself once per funnel step. For each user it finds the earliest timestamp for step 1, then finds the earliest timestamp for step 2 that is after step 1 and within the conversion window, and so on for all steps. In SQL this is expressed as a series of self-joins on user_id with timestamp ordering constraints. In a columnar OLAP engine like ClickHouse, specialized window functions (windowFunnel) or sequence match operators perform this more efficiently without explicit joins. For very high-cardinality event tables, the computation is parallelized by partitioning on user_id so each worker processes a subset of users independently, and results are aggregated at the end.

How does real-time funnel analysis differ from batch funnel computation?

Batch funnel computation runs periodic SQL queries over the full historical event table (hourly or daily) and is simple to implement but produces stale results. Real-time funnel analysis maintains per-user state in a stream processor: as each event arrives, the processor looks up the user's current funnel progress (stored in a state store like RocksDB or Redis), advances their position if the event matches the next expected step, and emits a conversion or drop-off record when the user completes or exceeds the window. Aggregated funnel counts are updated incrementally and made available with seconds of latency. The tradeoff is complexity: real-time funnels require managing state expiry (the conversion window), handling out-of-order events, and dealing with late data, all of which batch queries handle automatically by re-scanning history.

How are cohort segments applied to funnel analysis?

Cohort segments are applied to funnel analysis by filtering the user population to only those who match a segment definition before running the funnel sequence match. A segment can be a static list of user IDs (e.g., users who signed up in January), a behavioral filter (users who have ever completed a purchase), or a demographic filter (users in the US on iOS). At query time the funnel service joins the event table with the segment membership table or evaluates segment conditions as additional WHERE predicates. This allows product teams to compare funnel conversion rates across cohorts — for example, comparing new vs. returning users or free vs. paid subscribers — to determine whether differences in conversion are attributable to user characteristics or to the product experience.

Low Level Design: Funnel Analysis Service

⏱ 12 min read

Overview

A funnel analysis service determines what fraction of users complete an ordered sequence of steps — for example, landing page → signup → onboarding → first purchase. It must handle arbitrary event sequences, configurable conversion windows, step-level filtering, cohort segmentation, and both real-time and batch computation modes. This low level design covers the data model, matching algorithm, storage schema, query API, and the key trade-offs between real-time and batch approaches.

Requirements

Functional

Define funnels as an ordered list of event steps, each with optional property filters.
Compute per-step user counts and conversion rates.
Support conversion windows (e.g., user must complete all steps within 7 days of step 1).
Support strict ordering (steps must occur in exact order) and loose ordering (steps can have intervening events).
Segment results by dimensions (country, device, plan, etc.).
Support cohort analysis (group users by the date they entered the funnel).
Provide both near-real-time results (data updated within 5 minutes) and historical results.

Non-functional

Support funnels with up to 10 steps.
Query response time < 5 s for last-30-day funnels on tenants with up to 10 M users.
Event throughput: shares the 1 M events/sec ingestion pipeline with the broader analytics system.
Support up to 1 000 active funnel definitions per tenant.

Funnel Definition Schema

{
  funnel_id:   "f_abc123",
  tenant_id:   "acme",
  name:        "Signup to Purchase",
  steps: [
    {
      step_index: 0,
      event_type: "landing_page_view",
      filters: []
    },
    {
      step_index: 1,
      event_type: "signup_completed",
      filters: [
        {field: "properties.source", op: "eq", value: "organic"}
      ]
    },
    {
      step_index: 2,
      event_type: "onboarding_completed",
      filters: []
    },
    {
      step_index: 3,
      event_type: "purchase",
      filters: [
        {field: "properties.plan", op: "in", value: ["pro","enterprise"]}
      ]
    }
  ],
  conversion_window_hours: 168,   // 7 days
  order_mode: "strict",           // or loose
  created_at: "2024-04-01T00:00:00Z"
}

Event Sequence Matching Algorithm

Core Problem

Given a user's event history (ordered by timestamp), determine which funnel steps they completed in order within the conversion window. The algorithm must be correct under both strict and loose ordering modes.

Strict Ordering (most common)

In strict mode, each step must be the immediately next matching event after the previous step (no intervening events of later step types are allowed between steps).

function match_strict(events, steps, window_hours):
    step_idx = 0
    entry_ts = null
    results = []          // (step_index, matched_event_id, ts)

    for event in events (sorted by ts asc):
        if step_idx == 0:
            if matches(event, steps[0]):
                entry_ts = event.ts
                results.append((0, event.event_id, event.ts))
                step_idx = 1
        else:
            // Check conversion window
            if event.ts - entry_ts > window_hours * 3600:
                break
            // In strict mode: if event matches a LATER step, invalidate
            // intermediate steps that were skipped
            if matches(event, steps[step_idx]):
                results.append((step_idx, event.event_id, event.ts))
                step_idx += 1
                if step_idx == len(steps):
                    break
            elif any_later_step_matches(event, steps, step_idx):
                // Skip: do not advance, strict ordering violated
                pass

    return results  // user converted through step_idx - 1

This is O(E) per user where E is the number of events in the time window — typically a few hundred at most.

Loose Ordering

In loose mode, steps must occur in order but can have any intervening events. The algorithm simplifies: scan events in time order, advance the step pointer whenever the current event matches the current step.

function match_loose(events, steps, window_hours):
    step_idx = 0
    entry_ts = null

    for event in events (sorted by ts asc):
        if step_idx == 0 and matches(event, steps[0]):
            entry_ts = event.ts
            step_idx = 1
        elif step_idx > 0:
            if event.ts - entry_ts > window_hours * 3600:
                break
            if matches(event, steps[step_idx]):
                step_idx += 1
                if step_idx == len(steps):
                    break

    return step_idx  // number of steps completed

Conversion Window

The conversion window is anchored to step 0 (funnel entry). A user who completes step 0 at time T must complete all subsequent steps before T + window. This means:

A user who completes steps 0-2 in 5 days but does not complete step 3 within 7 days is counted as converted through step 2, not step 3.
If the same user later completes step 3 (beyond the window), they are counted as a new funnel entry starting from the next step-0 event.
Users can enter the funnel multiple times. The service counts each entry separately by default; a "unique users" mode deduplifies on user_id, counting only the first completed step.

Step Ordering Nuance: Re-entry

If a user completes step 0 multiple times (e.g., visits the landing page twice), the service creates multiple funnel entries, one per step-0 occurrence. Each entry is evaluated independently. The "entered funnel" count at step 0 is the number of entries, not unique users. This is the standard Mixpanel/Amplitude behavior.

Batch Computation

When to Use Batch

Historical funnel queries (date ranges in the past) are computed by a batch Spark job. The job runs on demand (triggered by the query API for cache misses) and on a nightly schedule to pre-warm the cache for common date ranges.

Spark Job

// Pseudo-code
events_df = spark.read.clickhouse(
    "SELECT user_id, event_type, received_at, properties FROM events "
    + "WHERE tenant_id = 'acme' "
    + "AND received_at BETWEEN start AND end + conversion_window"
)

// Add funnel step index to each event (-1 if does not match any step)
labeled_df = events_df
    .withColumn("step_index", match_steps_udf(col("event_type"), col("properties")))
    .filter(col("step_index") >= 0)

// Per user, apply sequence matching
results_df = labeled_df
    .groupBy("user_id")
    .agg(match_funnel_udf(
        collect_list(struct("step_index", "received_at")).orderBy("received_at"),
        conversion_window_hours,
        order_mode
    ).alias("max_step"))

// Aggregate across users
funnel_df = results_df
    .groupBy("max_step")
    .count()
    .orderBy("max_step")

The match_funnel_udf is a Python or Scala UDF implementing the matching algorithm above. Grouping by user_id before applying the UDF means the per-user event list fits in memory for all realistic users.

Cohort Segmentation in Batch

To segment by cohort (entry date), the batch job partitions funnel entries by toDate(entry_ts) and computes the funnel for each date separately. This produces a time series of daily cohort conversion rates without re-scanning all events — only the entry date partitioning changes.

Real-Time Computation

When to Use Real-Time

For the "last 24 hours" or "today" time range, batch latency is unacceptable. The real-time path uses a Flink stateful job that maintains per-user funnel state as events arrive.

Flink Stateful Funnel Job

State per user (keyed stream on user_id):

UserFunnelState {
  active_entries: List[FunnelEntry]
  // FunnelEntry = {entry_ts, max_step_completed, last_step_ts}
}

On each incoming event:

Determine which funnel steps the event matches (an event can match steps in multiple funnel definitions — the job handles all active funnels for the tenant).
For each active entry, attempt to advance the step pointer.
Expire entries where now() – entry_ts > conversion_window_hours (no longer eligible).
If the event matches step 0, create a new entry.
Emit a rollup update to a Redis sorted set whenever an entry advances or expires.

Real-Time State Storage

Per-funnel live counts are stored in Redis sorted sets, one per funnel per step:

Key:   funnel:{tenant}:{funnel_id}:step:{step_idx}:date:{YYYYMMDD}
Type:  PFADD (HyperLogLog for unique user counts)
       INCR for total event counts

The Query API reads from Redis for real-time ranges and from the batch result cache (stored in ClickHouse or S3) for historical ranges. The handoff point is configurable (default: use real-time for ranges ending within the last 6 hours).

Query API

Request

POST /v1/funnels/query
{
  "funnel_id":   "f_abc123",
  "tenant_id":   "acme",
  "time_range": {
    "start": "2024-04-06T00:00:00Z",
    "end":   "2024-04-13T00:00:00Z"
  },
  "segment_by":  "country",
  "cohort_by":   "entry_date",
  "filters": [
    {field: "country", op: "eq", value: "US"}
  ]
}

Response

{
  "funnel_id": "f_abc123",
  "steps": [
    {step_index: 0, event_type: "landing_page_view", count: 10000, conversion_rate: 1.0},
    {step_index: 1, event_type: "signup_completed",  count: 3200,  conversion_rate: 0.32},
    {step_index: 2, event_type: "onboarding_completed", count: 2100, conversion_rate: 0.656},
    {step_index: 3, event_type: "purchase",          count: 840,   conversion_rate: 0.4}
  ],
  "cohorts": [
    {
      "date": "2024-04-06",
      "steps": [...]
    }
  ],
  "segments": [
    {"country": "US", "steps": [...]}
  ],
  "computed_at": "2024-04-13T10:05:00Z",
  "data_freshness": "batch"   // or realtime
}

Funnel Result Storage (Batch Cache)

Completed batch funnel results are stored in ClickHouse:

CREATE TABLE funnel_results (
  tenant_id        String,
  funnel_id        String,
  time_range_start Date,
  time_range_end   Date,
  segment_key      String,         -- e.g. "country:US" or "__all__"
  cohort_date      Date,
  step_index       UInt8,
  user_count       UInt64,
  computed_at      DateTime
) ENGINE = ReplacingMergeTree(computed_at)
  PARTITION BY (tenant_id, toYYYYMM(time_range_start))
  ORDER BY (tenant_id, funnel_id, time_range_start, time_range_end, segment_key, cohort_date, step_index);

The Query API first checks this table. On cache miss, it enqueues a batch Spark job and returns HTTP 202 with a job_id. The client polls GET /v1/funnels/jobs/{job_id} until status = completed, then re-issues the original query.

Handling Large Tenant Volumes

For tenants with > 10 M users in the query window, the batch Spark job uses a sampling approach for exploratory queries: compute the funnel on a 10% sample (stratified by user_id hash) and scale up the counts. The response includes a sampled: true flag and a confidence interval. Full (unsampled) results are available with a longer SLA (up to 60 s).

Correctness Edge Cases

Clock skew: events arriving out of order due to mobile offline buffering. The batch job uses received_at for ordering; the real-time job uses a bounded watermark. Historical batch results are authoritative.
User merge: when two anonymous IDs are merged into one user_id (e.g., sign in on a second device), the funnel entry created under the old anonymous_id may now be attributed to the canonical user_id. The nightly identity resolution job rewrites affected funnel_results rows.
Funnel definition changes: changing a funnel definition (adding a step, changing filters) invalidates all cached results for that funnel. The service stores a funnel_version hash in funnel_results and invalidates on mismatch.
Partial completions near the time range boundary: a user who entered the funnel at the end of the query window may not have had time to complete later steps. The service excludes entries where entry_ts + conversion_window > time_range_end to avoid undercounting conversions for recent entrants.

Interview Tips

The interviewer will often ask about the strict vs loose ordering trade-off. Strict ordering is more intuitive for product funnels (no skipping steps) but harder to implement correctly. Be prepared to code the matching algorithm on a whiteboard.
Explain the re-entry semantics clearly: most analytics tools count funnel entries, not unique users at step 0. Know how to switch between modes.
Discuss when to use real-time Flink vs batch Spark: real-time state per user is expensive to maintain for thousands of concurrent funnel definitions. A common approach is to limit real-time to tenant-default funnels and use batch for ad-hoc queries.
The conversion window boundary problem (excluding incomplete entries near the end of the time range) is a subtle correctness issue that demonstrates depth of understanding.
Know ClickHouse FINAL keyword: SELECT … FROM funnel_results FINAL is needed when reading from ReplacingMergeTree before background merges complete, to guarantee deduplication of concurrent batch writes.