Machine Learning System Design Interview: Feed Ranking, Recommendations, and Search

ML system design interviews are distinct from algorithm interviews and from traditional system design interviews. You are expected to define the problem as an ML task, choose a modeling approach, design the data pipeline, and address production concerns like feature freshness, training-serving skew, and model monitoring. This guide covers the three most common ML system design questions.

ML System Design Framework

Apply this structure to any ML system design question:

  1. Problem definition: What are we optimizing? What is the ML task (ranking, classification, regression)?
  2. Data: What training data exists? What labels? How fresh does it need to be?
  3. Features: User features, item features, context features, cross features
  4. Model: Which architecture? Tradeoffs between simplicity and capacity
  5. Training pipeline: Offline training, online learning, feedback loops
  6. Serving pipeline: Latency constraints, retrieval + ranking stages, caching
  7. Evaluation: Offline metrics (AUC-ROC, NDCG) vs online metrics (CTR, session length, revenue)
  8. Monitoring and maintenance: Data drift, concept drift, model degradation

Problem 1: News Feed Ranking (Facebook/LinkedIn/Twitter)

Problem Definition

Given a user and their social graph, rank N candidate posts to show in their feed. Optimize for engagement (long-term user satisfaction, not just next click).

ML Task

Multi-task learning: predict multiple engagement signals simultaneously:

  • P(like | user, post)
  • P(comment | user, post)
  • P(share | user, post)
  • P(report/hide | user, post) — negative signal

Final score: weighted sum of predictions. Weights reflect business priorities (comments weighted higher than likes at Facebook).

Candidate Generation (Retrieval)

You cannot score all posts in the network — too slow. Two-stage retrieval:

  1. Social graph retrieval: posts from friends/follows in last 48h
  2. Interest-based retrieval: posts similar to historically engaged content (embedding similarity)
  3. Merge candidates, dedup, cap at ~1,000 candidates

Feature Engineering

Category Examples
User features Age account, avg daily sessions, top engaged topics, time since last login
Post features Age, media type, topic embeddings, engagement velocity (likes/hr in first 30 min)
Author features Connection strength to user, author engagement rate, author posting frequency
Context features Device type, time of day, day of week, current session length
Cross features User-topic affinity score, user-author historical interaction rate

Model Architecture

# Two-tower model for candidate retrieval
# Tower 1: user embedding
# Tower 2: post embedding
# Score: dot product of towers (cosine similarity)

# Full ranking model: DLRM (Deep Learning Recommendation Model)
# Input: sparse features (embeddings) + dense features
# Architecture: embedding layers → feature interaction layer → MLP → multi-task output heads
# Output: [p_like, p_comment, p_share, p_hide]

Serving Pipeline

Feed Request
    ↓
[Candidate Generator]     ← social graph DB, interest index
    ↓ ~1000 candidates
[Light Ranker]            ← fast model (GBDT), O(1000) latency budget: 10ms
    ↓ ~200 candidates
[Heavy Ranker]            ← deep neural net, O(200) latency budget: 50ms
    ↓ ~50 posts
[Policy Layer]            ← diversity, spam filter, content safety
    ↓
[User Feed]

Problem 2: Recommendation System (YouTube/Netflix/Spotify)

Problem Definition

Recommend items (videos/movies/songs) to a user. Optimize for long-term engagement and retention, not just immediate click.

Collaborative Filtering vs Content-Based

Approach How It Works Cold Start? Explainability
Collaborative filtering Users with similar history like similar items Problem (no history for new users) Low
Content-based Recommend items similar to what user liked Works for new users with item metadata High
Hybrid Combine both signals Manageable Medium
Two-tower neural Learn user and item embeddings end-to-end Partial (metadata for items) Low

Two-Tower Architecture

User Tower:
  Input: [user_id, watch_history_embeddings, search_history, demographics]
  → MLP → 256-dim user embedding

Item Tower:
  Input: [video_id, title_embedding, category, duration, upload_date]
  → MLP → 256-dim item embedding

Training: approximate inner product between user and item embeddings
Retrieval: ANN (Approximate Nearest Neighbors) using FAISS or ScaNN
  → find top-K items closest to user embedding in embedding space

Cold Start Solutions

  • New user: onboarding flow to collect initial preferences; use demographic-based priors; fallback to popularity-based recommendations
  • New item: use content features (title, description embeddings) to place item in embedding space; show to a random sample to collect initial engagement signals

Feature Store

Features need different freshness levels:

  • Real-time features (milliseconds): current session context, last clicked item — served from Redis
  • Near-real-time features (minutes): video engagement velocity, trending scores — from Kafka stream processing
  • Batch features (hours/days): user long-term preferences, item quality scores — from Spark jobs to a feature store (Feast, Tecton)

Problem 3: Search Ranking (Google/LinkedIn/Amazon)

ML Task

Learning to Rank (LTR): given a query and a set of candidate documents, rank them by relevance. Three approaches:

  • Pointwise: predict relevance score per (query, doc) pair. Simple, ignores relative ordering
  • Pairwise: predict which of two documents is more relevant. RankNet, LambdaRank
  • Listwise: optimize list-level metrics directly (NDCG). LambdaMART. Best offline metrics but harder to train

Query Understanding

# Step 1: query analysis
query = "python interview questions senior"
intent = classifier.predict(query)  # navigational | informational | transactional
entities = ner.extract(query)       # ["python", "interview questions", "senior"]
expanded_query = synonym_expand(query)  # + "software engineer interview"

# Step 2: candidate retrieval
bm25_candidates = inverted_index.search(query, top_k=1000)   # lexical match
dense_candidates = ann_index.search(query_embedding, top_k=500)  # semantic match
candidates = merge_and_dedup(bm25_candidates, dense_candidates)

Ranking Features

  • Query-document relevance: BM25 score, semantic similarity, query term coverage
  • Document quality: PageRank, click-through rate history, freshness, authority
  • Personalization: user search history affinity, location, language
  • Context: time of day, device, previous queries in session

Evaluation Metrics

Metric What It Measures When to Use
NDCG@K Quality of top-K results with graded relevance Search, recommendations
MAP Mean average precision across queries Information retrieval
AUC-ROC Binary classification quality CTR prediction, binary labels
MRR Mean reciprocal rank — position of first relevant result QA, navigational queries
Recall@K Fraction of relevant items in top-K Retrieval quality

Training-Serving Skew

The most common ML system bug: the model behaves differently in production than offline evaluation. Causes:

  • Features computed differently at training time vs serving time
  • Training data collected from a biased distribution (position bias — top results get more clicks)
  • Serving uses stale features (daily batch) but training assumes real-time freshness
  • Model trained on past distribution but deployed on shifted distribution (concept drift)

Fix: use the same feature computation code for training and serving (the feature store). Log features at serving time and use them directly for training (training on logged features). Correct for position bias with Inverse Propensity Scoring (IPS).

Model Monitoring

  • Data drift: input feature distributions shift (KL divergence, PSI — Population Stability Index)
  • Prediction drift: model output distribution shifts without obvious feature changes
  • Business metric regression: CTR, session length, revenue decline
  • Model staleness: performance degrades as world changes (retraining cadence)
# Monitoring checks (run every hour)
assert psi(training_feature_dist, serving_feature_dist) < 0.1  #  ctr_baseline * 0.95  # alert if CTR drops 5%+
assert p99_inference_latency_ms < 50  # SLO

Frequently Asked Questions

What is the two-stage retrieval and ranking architecture used in recommendation systems?

Two-stage architecture solves the challenge of scoring millions of items within milliseconds. Stage 1 (retrieval/candidate generation): a fast, approximate method retrieves hundreds to thousands of candidates from millions of items. Common approaches: approximate nearest neighbor search on learned embeddings (FAISS/ScaNN), collaborative filtering, BM25 for content retrieval. Stage 2 (ranking): a more expensive model (deep neural network) re-ranks the candidate set. The light pre-ranker may reduce 1,000 to 200 candidates; the heavy ranker picks the final 50. This lets you use a complex model where it matters (the final ranking) without paying its cost on millions of items.

What is training-serving skew in machine learning and how do you prevent it?

Training-serving skew occurs when the model behaves differently in production than it did during offline evaluation. Common causes: features are computed differently at training and serving time (different code paths), training data was collected with a biased distribution (position bias — top results get more clicks regardless of relevance), or serving uses stale features while training assumed fresh ones. Prevention: use a feature store that provides the same feature values in both training and serving, log features at serving time and use those exact logged values for training (training on logged features), and correct for position bias with inverse propensity scoring. Monitor for skew by comparing feature distributions between training data and live traffic.

How do you evaluate a machine learning recommendation system?

Offline metrics measure model quality on held-out data: NDCG@K (normalized discounted cumulative gain at K positions — measures ranking quality with graded relevance), AUC-ROC (ranking quality for binary labels like click/no-click), Recall@K (fraction of relevant items retrieved in top K). Online metrics measure business impact: CTR (click-through rate), session length, return visit rate, revenue per session. Offline metrics do not always correlate with online metrics — a model with higher NDCG may produce lower user retention. Always A/B test before fully launching. Watch for Goodhart Law: optimizing a proxy metric (CTR) can harm the true goal (long-term user satisfaction) if users click clickbait.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the two-stage retrieval and ranking architecture used in recommendation systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two-stage architecture solves the challenge of scoring millions of items within milliseconds. Stage 1 (retrieval/candidate generation): a fast, approximate method retrieves hundreds to thousands of candidates from millions of items. Common approaches: approximate nearest neighbor search on learned embeddings (FAISS/ScaNN), collaborative filtering, BM25 for content retrieval. Stage 2 (ranking): a more expensive model (deep neural network) re-ranks the candidate set. The light pre-ranker may reduce 1,000 to 200 candidates; the heavy ranker picks the final 50. This lets you use a complex model where it matters (the final ranking) without paying its cost on millions of items.”
}
},
{
“@type”: “Question”,
“name”: “What is training-serving skew in machine learning and how do you prevent it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Training-serving skew occurs when the model behaves differently in production than it did during offline evaluation. Common causes: features are computed differently at training and serving time (different code paths), training data was collected with a biased distribution (position bias — top results get more clicks regardless of relevance), or serving uses stale features while training assumed fresh ones. Prevention: use a feature store that provides the same feature values in both training and serving, log features at serving time and use those exact logged values for training (training on logged features), and correct for position bias with inverse propensity scoring. Monitor for skew by comparing feature distributions between training data and live traffic.”
}
},
{
“@type”: “Question”,
“name”: “How do you evaluate a machine learning recommendation system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Offline metrics measure model quality on held-out data: NDCG@K (normalized discounted cumulative gain at K positions — measures ranking quality with graded relevance), AUC-ROC (ranking quality for binary labels like click/no-click), Recall@K (fraction of relevant items retrieved in top K). Online metrics measure business impact: CTR (click-through rate), session length, return visit rate, revenue per session. Offline metrics do not always correlate with online metrics — a model with higher NDCG may produce lower user retention. Always A/B test before fully launching. Watch for Goodhart Law: optimizing a proxy metric (CTR) can harm the true goal (long-term user satisfaction) if users click clickbait.”
}
}
]
}

Companies That Ask This Question

Scroll to Top