Recommendation Feedback Loop Low-Level Design: Implicit Signals, Click-Through Tracking, and Model Retraining

Recommendation Feedback Loop: Overview

A recommendation system without a feedback loop becomes stale. The feedback loop closes the cycle: collect implicit signals from user interactions, aggregate them, and use them to retrain or update the recommendation model. Low-level design covers signal collection schemas, aggregation pipelines, CTR computation, and retraining cadences.

Implicit Signals and Their Weights

Explicit signals (star ratings) are rare. Implicit signals are abundant but noisy:

  • impression — item was shown to user. Not a signal by itself; it is the denominator for CTR.
  • click — user selected the item. Weight: 0.3 (moderate positive).
  • dwell_time — seconds spent on item. Weight: 0.2 if > 30s (indicates engagement). Short dwell after click is a negative signal.
  • skip — user scrolled past item without click. Weight: -0.1 (mild negative).
  • save — user bookmarked item. Weight: 1.0 (strong positive intent).
  • share — user shared item externally. Weight: 1.0 (strongest positive).

Signal weights are tuned via A/B experiments comparing offline AUC and online CTR across weight configurations.

Click-Through Rate (CTR) Computation

CTR is computed per (item_id, position) pair to account for position bias:

CTR(item, position) = clicks(item, position) / impressions(item, position)

Raw CTR is noisy for items with few impressions. Apply Bayesian smoothing with a global prior:

smoothed_CTR = (clicks + prior_clicks) / (impressions + prior_impressions)

Where prior values come from the global average CTR across all items. This prevents cold-start items from ranking at extremes.

Session Context Capture

Each impression and interaction is logged with context that enables debiasing and feature engineering:

  • Position: rank in the result list (position bias — top items get more clicks regardless of quality)
  • Query: search query or recommendation context (homepage feed vs search results)
  • Device: mobile vs desktop (affects dwell time interpretation)
  • Session ID: groups impressions and interactions within a user session

Near-Real-Time Aggregation Pipeline

Signal events flow through a streaming pipeline to update the feature store within minutes:

  1. Client logs events to a Kafka topic (signal-events) with structured JSON payloads.
  2. Flink (or Spark Structured Streaming) consumes the topic, applies windowed aggregations (5-minute tumbling window), and emits aggregated signal counts.
  3. Aggregated counts are upserted into the feature store (Redis or a fast OLAP DB) for use by the ranking model at inference time.
  4. The feature store exposes a low-latency API for serving: GET /features/item/{item_id} returns CTR, save rate, dwell_avg.

Batch Retraining Pipeline

A daily batch job (Spark or Python) retrains the core recommendation model:

  1. Materialize the interaction matrix from InteractionLog (user_id, item_id, weighted_signal) over a 90-day window.
  2. Train an ALS (Alternating Least Squares) collaborative filtering model on the matrix.
  3. Optionally train an XGBoost ranking model using features: item CTR, user-item affinity, content metadata, session context.
  4. Evaluate on a held-out validation set (last 7 days). Gate on AUC and NDCG thresholds before promotion.
  5. Deploy new model to serving layer via blue-green swap.

Online Learning for High-Confidence Signals

For fast adaptation (e.g., trending items), apply incremental model updates on save and share events (weight >= 1.0). Update item embedding or bias term in a lightweight online learning layer without full retraining. Constrain update magnitude to avoid instability from outlier signal bursts.

SQL Schema

-- Raw impression log (high volume, partitioned by date)
CREATE TABLE ImpressionLog (
    user_id     BIGINT NOT NULL,
    item_id     BIGINT NOT NULL,
    session_id  UUID NOT NULL,
    position    INT NOT NULL,
    shown_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (shown_at);
CREATE INDEX idx_impressionlog_item ON ImpressionLog(item_id, shown_at);

-- Interaction log with signal type and value
CREATE TABLE InteractionLog (
    user_id      BIGINT NOT NULL,
    item_id      BIGINT NOT NULL,
    signal_type  VARCHAR(32) NOT NULL,   -- click, dwell_time, skip, save, share
    value        DOUBLE PRECISION NOT NULL DEFAULT 1.0,  -- dwell seconds or 1.0
    occurred_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_interactionlog_item ON InteractionLog(item_id, occurred_at DESC);
CREATE INDEX idx_interactionlog_user ON InteractionLog(user_id, occurred_at DESC);

-- Materialized CTR per item (refreshed by streaming aggregation)
CREATE TABLE ItemCTR (
    item_id      BIGINT PRIMARY KEY,
    impressions  BIGINT NOT NULL DEFAULT 0,
    clicks       BIGINT NOT NULL DEFAULT 0,
    ctr          DOUBLE PRECISION GENERATED ALWAYS AS
                     (CASE WHEN impressions = 0 THEN 0 ELSE clicks::float / impressions END) STORED,
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Python Implementation

import time
import json
from kafka import KafkaProducer
from typing import Optional

producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

SIGNAL_WEIGHTS = {
    'save': 1.0,
    'share': 1.0,
    'click': 0.3,
    'dwell_time': 0.2,  # applied only if value > 30
    'skip': -0.1,
    'impression': 0.0,
}

def log_impression(user_id: int, item_id: int, session_id: str, position: int) -> None:
    """Emit impression event to Kafka."""
    event = {
        "event_type": "impression",
        "user_id": user_id,
        "item_id": item_id,
        "session_id": session_id,
        "position": position,
        "timestamp": time.time()
    }
    producer.send('signal-events', value=event)

def log_interaction(user_id: int, item_id: int, signal_type: str, value: float = 1.0) -> None:
    """Emit interaction event to Kafka with weighted signal."""
    weight = SIGNAL_WEIGHTS.get(signal_type, 0.0)
    if signal_type == 'dwell_time' and value  float:
    """Return Bayesian-smoothed CTR for item."""
    row = db.execute(
        "SELECT impressions, clicks FROM ItemCTR WHERE item_id=%s", (item_id,)
    ).fetchone()
    if not row:
        return prior_clicks / prior_impressions
    impressions, clicks = row
    return (clicks + prior_clicks) / (impressions + prior_impressions)

def aggregate_signals_window(window_events: list) -> dict:
    """Aggregate signal events from a streaming window into item-level stats."""
    stats = {}
    for event in window_events:
        item_id = event['item_id']
        if item_id not in stats:
            stats[item_id] = {'impressions': 0, 'clicks': 0, 'weighted_score': 0.0}
        if event['event_type'] == 'impression':
            stats[item_id]['impressions'] += 1
        else:
            if event['signal_type'] == 'click':
                stats[item_id]['clicks'] += 1
            stats[item_id]['weighted_score'] += event.get('weighted_score', 0.0)
    return stats

Key Design Decisions Summary

  • Impression logging at high volume requires date-partitioned tables and Kafka buffering to avoid write bottlenecks.
  • Bayesian CTR smoothing prevents cold-start items from appearing artificially high or low.
  • Streaming aggregation (Flink/Spark) closes the feedback loop within minutes for trending signals.
  • Daily batch retraining with quality gates prevents model degradation from noisy data.
  • Position bias must be corrected in training data (use position as a feature, or use inverse propensity weighting).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is dwell time measured as an implicit signal?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Dwell time is measured client-side by recording the timestamp when a user navigates to an item page and when they navigate away. The difference is sent as a dwell_time interaction event. A threshold (e.g., 30 seconds) distinguishes genuine engagement from accidental clicks. Short dwell after click is treated as a mild negative signal.”
}
},
{
“@type”: “Question”,
“name”: “Why are skips treated as negative signals in a feedback loop?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A skip (scrolling past an item without clicking) indicates the item was visible but not appealing. Repeated skips on a given item by a user signal low relevance for that user. The weight is kept small (e.g., -0.1) because skips can also indicate the user was busy or already knew the content rather than genuinely disliking it.”
}
},
{
“@type”: “Question”,
“name”: “When should near-real-time aggregation be used vs batch retraining?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Near-real-time aggregation (Kafka + Flink) is used to update feature store values — CTR, save rate, engagement score — within minutes, allowing the ranker to reflect current trending content without full retraining. Batch retraining is used for updating model weights (ALS, XGBoost) on a daily or weekly cadence since it requires a full pass over the interaction matrix and quality validation before deployment.”
}
},
{
“@type”: “Question”,
“name”: “How is position bias corrected in recommendation feedback?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Position bias occurs because items ranked higher receive more clicks regardless of quality. Correction techniques include: (1) including position as a feature in the model so it learns to discount top-position clicks, (2) inverse propensity scoring — weighting each training example by 1/P(click|position) to normalize for position-driven click rates, and (3) randomized exploration — occasionally inserting items at random positions to collect unbiased signal.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is dwell time measured as an implicit signal?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A JavaScript timer starts when an item enters the viewport and stops when it leaves or the page unloads; the duration is sent as a backend event and stored in InteractionLog as signal_type=dwell with the duration as value.”
}
},
{
“@type”: “Question”,
“name”: “How is position bias corrected in CTR-based feedback?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Items shown at position 1 receive more clicks purely due to position; inverse propensity scoring (IPS) divides the click signal by the estimated probability of being shown at that position to produce a position-debiased reward.”
}
},
{
“@type”: “Question”,
“name”: “How does near-real-time feedback update the recommendation model?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Interaction events stream through Kafka; a Flink job aggregates signals per (user, item) pair within a 5-minute window and writes updated feature values to the online feature store, which the ranking model reads at inference time.”
}
},
{
“@type”: “Question”,
“name”: “How does the batch retraining pipeline use feedback data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A daily Spark job materializes the full interaction matrix from InteractionLog and retrains the ALS collaborative filtering model; the retrained model is registered in the model registry and deployed to the serving layer.”
}
}
]
}

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top