User Journey Tracker: Low-Level Design
A user journey tracker records every touchpoint across devices and channels, resolves them to a single user identity, and provides attribution analysis to answer which marketing or product interactions led to conversion. The hardest problems are cross-device identity stitching and privacy-compliant tracking.
Identity Graph
The core data structure is an identity graph mapping all known identifiers for a person to a single canonical user_id:
- anonymous_id — generated on first visit, stored in a first-party cookie or localStorage
- On login, the anonymous_id is linked to the authenticated
user_id(identity merge) - device_id — per-device persistent identifier
- email hash — used as a stable cross-device key when the user provides email on multiple devices
The graph is stored as an adjacency list: each node is an identifier, edges represent known equivalences. A union-find structure resolves all connected identifiers to a single canonical node.
Deterministic Matching
Deterministic identity resolution uses exact-match signals with 100% confidence:
- User logs in on mobile →
device_id_mobilelinked touser_id - User logs in on desktop →
device_id_desktoplinked to sameuser_id - Both device graphs are merged: all events on either device now belong to one journey
- Email address (hashed) shared across apps (e.g., loyalty app and e-commerce site) links the two identity graphs
Probabilistic Matching
When deterministic signals are unavailable, probabilistic matching infers identity from behavioral signals:
- Same IP address + identical user-agent fingerprint within a short time window → high-confidence match
- Each probabilistic link carries a confidence score (0.0–1.0)
- Links below threshold (e.g., 0.7) are flagged as probabilistic and excluded from high-stakes attribution by default
- Fingerprint features: IP, browser version, screen resolution, timezone, language, installed fonts
Journey Storage
Each resolved identity has an ordered event sequence stored in a columnar format (e.g., BigQuery or Redshift):
[{event_type, timestamp, channel, device, page_url, campaign_id}, ...]
Events are partitioned by user_id and date for efficient per-user lookups. A separate journey summary table stores pre-aggregated path data for the dashboard layer.
Path Analysis
SQL or Spark jobs compute the most common paths to conversion:
- Paths are deduplicated (consecutive duplicate events collapsed)
- Attribution windows define which events are eligible — default 30-day lookback
- Sankey diagrams and funnel visualizations are built from pre-aggregated path frequency tables
Multi-Touch Attribution Models
Multiple attribution models are computed in parallel and exposed via the dashboard:
- First-touch — 100% credit to the first channel that acquired the user
- Last-touch — 100% credit to the channel immediately before conversion
- Linear — equal credit split across all touchpoints in the journey
- Time-decay — more credit assigned to touchpoints closer to conversion (half-life weighting)
- Data-driven — ML model (Shapley values or logistic regression) learns which touchpoints causally contribute to conversion
Each model is stored as a separate column in the attribution results table, enabling side-by-side comparison.
UTM Parameter Tracking
All inbound traffic is tagged with UTM parameters: utm_source, utm_medium, utm_campaign, utm_content, utm_term. These are captured on landing and persisted in the session cookie, then attached to every event in that session. UTMs are the primary signal for channel attribution when no login event is present.
Cross-Channel Journey Example
A typical journey might look like: organic search → product page view → email campaign click → return visit → paid retargeting ad → purchase. Each step has a different device, channel, and timing. The identity graph ensures all steps are stitched to one user, and the attribution model distributes conversion credit across all five touchpoints.
Privacy Compliance
- Consent tracking — no events collected before consent is granted; consent state stored per user_id
- GDPR deletion — on erasure request, all events for the resolved identity are deleted or anonymized within 30 days
- IP anonymization — last octet of IPv4 (last 80 bits of IPv6) zeroed before storage
- Probabilistic links removed from identity graph on deletion request
Cookieless Tracking Alternatives
As third-party cookies are deprecated, the tracker relies on server-side first-party cookies set on the customer's own domain via a CNAME-proxied collection endpoint. This ensures cookies persist across browser sessions while remaining first-party. For environments where even first-party cookies are restricted (Safari ITP), the anonymous_id is regenerated per session and probabilistic stitching provides continuity.
Summary
The user journey tracker combines deterministic and probabilistic identity resolution into a unified graph, stores ordered event sequences in a columnar warehouse, and computes multiple attribution models to support channel optimization decisions — all while enforcing consent and privacy requirements at every layer.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does deterministic identity stitching work across devices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deterministic stitching links device-level anonymous IDs to a canonical user ID the moment a user authenticates on any device, creating an edge in an identity graph keyed by (user_id, device_id). All historical events previously attributed to the anonymous device ID are retroactively re-attributed to the user ID via a backfill job or by joining on the identity graph at query time.”
}
},
{
“@type”: “Question”,
“name”: “How do multi-touch attribution models differ?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “First-touch credits 100% of the conversion to the earliest touchpoint; last-touch credits the final touchpoint; linear attribution distributes credit evenly across all touchpoints in the journey; time-decay assigns exponentially more credit to touchpoints closer to conversion. Data-driven attribution uses a Shapley value or logistic regression model trained on observed conversion outcomes to assign credit based on each channel's marginal contribution.”
}
},
{
“@type”: “Question”,
“name”: “What is the attribution lookback window and how is it set?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The lookback window is the maximum time before a conversion event within which a prior touchpoint is eligible for attribution credit; common values are 1 day for view-through and 30 days for click-through. The window is calibrated by analyzing the empirical distribution of time-to-conversion in historical data — typically set at the 90th–95th percentile of that distribution to capture most journeys without polluting attribution with stale touches.”
}
},
{
“@type”: “Question”,
“name”: “How is cross-channel journey data stored for analysis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cross-channel journey events are written to a time-partitioned event table in a columnar data warehouse (e.g., BigQuery or Redshift) with a canonical schema: user_id, timestamp, channel, event_type, and a JSON properties bag. An identity graph table joins device-level IDs to user IDs at query time, allowing analysts to reconstruct ordered journeys per user using window functions without requiring pre-computed session tables.”
}
}
]
}
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering