Low Level Design: Funnel Analysis Service

Overview

A funnel analysis service determines what fraction of users complete an ordered sequence of steps — for example, landing page → signup → onboarding → first purchase. It must handle arbitrary event sequences, configurable conversion windows, step-level filtering, cohort segmentation, and both real-time and batch computation modes. This low level design covers the data model, matching algorithm, storage schema, query API, and the key trade-offs between real-time and batch approaches.

Requirements

Functional

  • Define funnels as an ordered list of event steps, each with optional property filters.
  • Compute per-step user counts and conversion rates.
  • Support conversion windows (e.g., user must complete all steps within 7 days of step 1).
  • Support strict ordering (steps must occur in exact order) and loose ordering (steps can have intervening events).
  • Segment results by dimensions (country, device, plan, etc.).
  • Support cohort analysis (group users by the date they entered the funnel).
  • Provide both near-real-time results (data updated within 5 minutes) and historical results.

Non-functional

  • Support funnels with up to 10 steps.
  • Query response time < 5 s for last-30-day funnels on tenants with up to 10 M users.
  • Event throughput: shares the 1 M events/sec ingestion pipeline with the broader analytics system.
  • Support up to 1 000 active funnel definitions per tenant.

Funnel Definition Schema

{
  funnel_id:   "f_abc123",
  tenant_id:   "acme",
  name:        "Signup to Purchase",
  steps: [
    {
      step_index: 0,
      event_type: "landing_page_view",
      filters: []
    },
    {
      step_index: 1,
      event_type: "signup_completed",
      filters: [
        {field: "properties.source", op: "eq", value: "organic"}
      ]
    },
    {
      step_index: 2,
      event_type: "onboarding_completed",
      filters: []
    },
    {
      step_index: 3,
      event_type: "purchase",
      filters: [
        {field: "properties.plan", op: "in", value: ["pro","enterprise"]}
      ]
    }
  ],
  conversion_window_hours: 168,   // 7 days
  order_mode: "strict",           // or loose
  created_at: "2024-04-01T00:00:00Z"
}

Event Sequence Matching Algorithm

Core Problem

Given a user's event history (ordered by timestamp), determine which funnel steps they completed in order within the conversion window. The algorithm must be correct under both strict and loose ordering modes.

Strict Ordering (most common)

In strict mode, each step must be the immediately next matching event after the previous step (no intervening events of later step types are allowed between steps).

function match_strict(events, steps, window_hours):
    step_idx = 0
    entry_ts = null
    results = []          // (step_index, matched_event_id, ts)

    for event in events (sorted by ts asc):
        if step_idx == 0:
            if matches(event, steps[0]):
                entry_ts = event.ts
                results.append((0, event.event_id, event.ts))
                step_idx = 1
        else:
            // Check conversion window
            if event.ts - entry_ts > window_hours * 3600:
                break
            // In strict mode: if event matches a LATER step, invalidate
            // intermediate steps that were skipped
            if matches(event, steps[step_idx]):
                results.append((step_idx, event.event_id, event.ts))
                step_idx += 1
                if step_idx == len(steps):
                    break
            elif any_later_step_matches(event, steps, step_idx):
                // Skip: do not advance, strict ordering violated
                pass

    return results  // user converted through step_idx - 1

This is O(E) per user where E is the number of events in the time window — typically a few hundred at most.

Loose Ordering

In loose mode, steps must occur in order but can have any intervening events. The algorithm simplifies: scan events in time order, advance the step pointer whenever the current event matches the current step.

function match_loose(events, steps, window_hours):
    step_idx = 0
    entry_ts = null

    for event in events (sorted by ts asc):
        if step_idx == 0 and matches(event, steps[0]):
            entry_ts = event.ts
            step_idx = 1
        elif step_idx > 0:
            if event.ts - entry_ts > window_hours * 3600:
                break
            if matches(event, steps[step_idx]):
                step_idx += 1
                if step_idx == len(steps):
                    break

    return step_idx  // number of steps completed

Conversion Window

The conversion window is anchored to step 0 (funnel entry). A user who completes step 0 at time T must complete all subsequent steps before T + window. This means:

  • A user who completes steps 0-2 in 5 days but does not complete step 3 within 7 days is counted as converted through step 2, not step 3.
  • If the same user later completes step 3 (beyond the window), they are counted as a new funnel entry starting from the next step-0 event.
  • Users can enter the funnel multiple times. The service counts each entry separately by default; a "unique users" mode deduplifies on user_id, counting only the first completed step.

Step Ordering Nuance: Re-entry

If a user completes step 0 multiple times (e.g., visits the landing page twice), the service creates multiple funnel entries, one per step-0 occurrence. Each entry is evaluated independently. The "entered funnel" count at step 0 is the number of entries, not unique users. This is the standard Mixpanel/Amplitude behavior.

Batch Computation

When to Use Batch

Historical funnel queries (date ranges in the past) are computed by a batch Spark job. The job runs on demand (triggered by the query API for cache misses) and on a nightly schedule to pre-warm the cache for common date ranges.

Spark Job

// Pseudo-code
events_df = spark.read.clickhouse(
    "SELECT user_id, event_type, received_at, properties FROM events "
    + "WHERE tenant_id = 'acme' "
    + "AND received_at BETWEEN start AND end + conversion_window"
)

// Add funnel step index to each event (-1 if does not match any step)
labeled_df = events_df
    .withColumn("step_index", match_steps_udf(col("event_type"), col("properties")))
    .filter(col("step_index") >= 0)

// Per user, apply sequence matching
results_df = labeled_df
    .groupBy("user_id")
    .agg(match_funnel_udf(
        collect_list(struct("step_index", "received_at")).orderBy("received_at"),
        conversion_window_hours,
        order_mode
    ).alias("max_step"))

// Aggregate across users
funnel_df = results_df
    .groupBy("max_step")
    .count()
    .orderBy("max_step")

The match_funnel_udf is a Python or Scala UDF implementing the matching algorithm above. Grouping by user_id before applying the UDF means the per-user event list fits in memory for all realistic users.

Cohort Segmentation in Batch

To segment by cohort (entry date), the batch job partitions funnel entries by toDate(entry_ts) and computes the funnel for each date separately. This produces a time series of daily cohort conversion rates without re-scanning all events — only the entry date partitioning changes.

Real-Time Computation

When to Use Real-Time

For the "last 24 hours" or "today" time range, batch latency is unacceptable. The real-time path uses a Flink stateful job that maintains per-user funnel state as events arrive.

Flink Stateful Funnel Job

State per user (keyed stream on user_id):

UserFunnelState {
  active_entries: List[FunnelEntry]
  // FunnelEntry = {entry_ts, max_step_completed, last_step_ts}
}

On each incoming event:

  1. Determine which funnel steps the event matches (an event can match steps in multiple funnel definitions — the job handles all active funnels for the tenant).
  2. For each active entry, attempt to advance the step pointer.
  3. Expire entries where now() – entry_ts > conversion_window_hours (no longer eligible).
  4. If the event matches step 0, create a new entry.
  5. Emit a rollup update to a Redis sorted set whenever an entry advances or expires.

Real-Time State Storage

Per-funnel live counts are stored in Redis sorted sets, one per funnel per step:

Key:   funnel:{tenant}:{funnel_id}:step:{step_idx}:date:{YYYYMMDD}
Type:  PFADD (HyperLogLog for unique user counts)
       INCR for total event counts

The Query API reads from Redis for real-time ranges and from the batch result cache (stored in ClickHouse or S3) for historical ranges. The handoff point is configurable (default: use real-time for ranges ending within the last 6 hours).

Query API

Request

POST /v1/funnels/query
{
  "funnel_id":   "f_abc123",
  "tenant_id":   "acme",
  "time_range": {
    "start": "2024-04-06T00:00:00Z",
    "end":   "2024-04-13T00:00:00Z"
  },
  "segment_by":  "country",
  "cohort_by":   "entry_date",
  "filters": [
    {field: "country", op: "eq", value: "US"}
  ]
}

Response

{
  "funnel_id": "f_abc123",
  "steps": [
    {step_index: 0, event_type: "landing_page_view", count: 10000, conversion_rate: 1.0},
    {step_index: 1, event_type: "signup_completed",  count: 3200,  conversion_rate: 0.32},
    {step_index: 2, event_type: "onboarding_completed", count: 2100, conversion_rate: 0.656},
    {step_index: 3, event_type: "purchase",          count: 840,   conversion_rate: 0.4}
  ],
  "cohorts": [
    {
      "date": "2024-04-06",
      "steps": [...]
    }
  ],
  "segments": [
    {"country": "US", "steps": [...]}
  ],
  "computed_at": "2024-04-13T10:05:00Z",
  "data_freshness": "batch"   // or realtime
}

Funnel Result Storage (Batch Cache)

Completed batch funnel results are stored in ClickHouse:

CREATE TABLE funnel_results (
  tenant_id        String,
  funnel_id        String,
  time_range_start Date,
  time_range_end   Date,
  segment_key      String,         -- e.g. "country:US" or "__all__"
  cohort_date      Date,
  step_index       UInt8,
  user_count       UInt64,
  computed_at      DateTime
) ENGINE = ReplacingMergeTree(computed_at)
  PARTITION BY (tenant_id, toYYYYMM(time_range_start))
  ORDER BY (tenant_id, funnel_id, time_range_start, time_range_end, segment_key, cohort_date, step_index);

The Query API first checks this table. On cache miss, it enqueues a batch Spark job and returns HTTP 202 with a job_id. The client polls GET /v1/funnels/jobs/{job_id} until status = completed, then re-issues the original query.

Handling Large Tenant Volumes

For tenants with > 10 M users in the query window, the batch Spark job uses a sampling approach for exploratory queries: compute the funnel on a 10% sample (stratified by user_id hash) and scale up the counts. The response includes a sampled: true flag and a confidence interval. Full (unsampled) results are available with a longer SLA (up to 60 s).

Correctness Edge Cases

  • Clock skew: events arriving out of order due to mobile offline buffering. The batch job uses received_at for ordering; the real-time job uses a bounded watermark. Historical batch results are authoritative.
  • User merge: when two anonymous IDs are merged into one user_id (e.g., sign in on a second device), the funnel entry created under the old anonymous_id may now be attributed to the canonical user_id. The nightly identity resolution job rewrites affected funnel_results rows.
  • Funnel definition changes: changing a funnel definition (adding a step, changing filters) invalidates all cached results for that funnel. The service stores a funnel_version hash in funnel_results and invalidates on mismatch.
  • Partial completions near the time range boundary: a user who entered the funnel at the end of the query window may not have had time to complete later steps. The service excludes entries where entry_ts + conversion_window > time_range_end to avoid undercounting conversions for recent entrants.

Interview Tips

  • The interviewer will often ask about the strict vs loose ordering trade-off. Strict ordering is more intuitive for product funnels (no skipping steps) but harder to implement correctly. Be prepared to code the matching algorithm on a whiteboard.
  • Explain the re-entry semantics clearly: most analytics tools count funnel entries, not unique users at step 0. Know how to switch between modes.
  • Discuss when to use real-time Flink vs batch Spark: real-time state per user is expensive to maintain for thousands of concurrent funnel definitions. A common approach is to limit real-time to tenant-default funnels and use batch for ad-hoc queries.
  • The conversion window boundary problem (excluding incomplete entries near the end of the time range) is a subtle correctness issue that demonstrates depth of understanding.
  • Know ClickHouse FINAL keyword: SELECT … FROM funnel_results FINAL is needed when reading from ReplacingMergeTree before background merges complete, to guarantee deduplication of concurrent batch writes.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

Scroll to Top