ML system design interviews are distinct from algorithm interviews and from traditional system design interviews. You are expected to define the problem as an ML task, choose a modeling approach, design the data pipeline, and address production concerns like feature freshness, training-serving skew, and model monitoring. This guide covers the three most common ML system design questions.
ML System Design Framework
Apply this structure to any ML system design question:
- Problem definition: What are we optimizing? What is the ML task (ranking, classification, regression)?
- Data: What training data exists? What labels? How fresh does it need to be?
- Features: User features, item features, context features, cross features
- Model: Which architecture? Tradeoffs between simplicity and capacity
- Training pipeline: Offline training, online learning, feedback loops
- Serving pipeline: Latency constraints, retrieval + ranking stages, caching
- Evaluation: Offline metrics (AUC-ROC, NDCG) vs online metrics (CTR, session length, revenue)
- Monitoring and maintenance: Data drift, concept drift, model degradation
Problem 1: News Feed Ranking (Facebook/LinkedIn/Twitter)
Problem Definition
Given a user and their social graph, rank N candidate posts to show in their feed. Optimize for engagement (long-term user satisfaction, not just next click).
ML Task
Multi-task learning: predict multiple engagement signals simultaneously:
- P(like | user, post)
- P(comment | user, post)
- P(share | user, post)
- P(report/hide | user, post) — negative signal
Final score: weighted sum of predictions. Weights reflect business priorities (comments weighted higher than likes at Facebook).
Candidate Generation (Retrieval)
You cannot score all posts in the network — too slow. Two-stage retrieval:
- Social graph retrieval: posts from friends/follows in last 48h
- Interest-based retrieval: posts similar to historically engaged content (embedding similarity)
- Merge candidates, dedup, cap at ~1,000 candidates
Feature Engineering
| Category | Examples |
|---|---|
| User features | Age account, avg daily sessions, top engaged topics, time since last login |
| Post features | Age, media type, topic embeddings, engagement velocity (likes/hr in first 30 min) |
| Author features | Connection strength to user, author engagement rate, author posting frequency |
| Context features | Device type, time of day, day of week, current session length |
| Cross features | User-topic affinity score, user-author historical interaction rate |
Model Architecture
# Two-tower model for candidate retrieval
# Tower 1: user embedding
# Tower 2: post embedding
# Score: dot product of towers (cosine similarity)
# Full ranking model: DLRM (Deep Learning Recommendation Model)
# Input: sparse features (embeddings) + dense features
# Architecture: embedding layers → feature interaction layer → MLP → multi-task output heads
# Output: [p_like, p_comment, p_share, p_hide]
Serving Pipeline
Feed Request
↓
[Candidate Generator] ← social graph DB, interest index
↓ ~1000 candidates
[Light Ranker] ← fast model (GBDT), O(1000) latency budget: 10ms
↓ ~200 candidates
[Heavy Ranker] ← deep neural net, O(200) latency budget: 50ms
↓ ~50 posts
[Policy Layer] ← diversity, spam filter, content safety
↓
[User Feed]
Problem 2: Recommendation System (YouTube/Netflix/Spotify)
Problem Definition
Recommend items (videos/movies/songs) to a user. Optimize for long-term engagement and retention, not just immediate click.
Collaborative Filtering vs Content-Based
| Approach | How It Works | Cold Start? | Explainability |
|---|---|---|---|
| Collaborative filtering | Users with similar history like similar items | Problem (no history for new users) | Low |
| Content-based | Recommend items similar to what user liked | Works for new users with item metadata | High |
| Hybrid | Combine both signals | Manageable | Medium |
| Two-tower neural | Learn user and item embeddings end-to-end | Partial (metadata for items) | Low |
Two-Tower Architecture
User Tower:
Input: [user_id, watch_history_embeddings, search_history, demographics]
→ MLP → 256-dim user embedding
Item Tower:
Input: [video_id, title_embedding, category, duration, upload_date]
→ MLP → 256-dim item embedding
Training: approximate inner product between user and item embeddings
Retrieval: ANN (Approximate Nearest Neighbors) using FAISS or ScaNN
→ find top-K items closest to user embedding in embedding space
Cold Start Solutions
- New user: onboarding flow to collect initial preferences; use demographic-based priors; fallback to popularity-based recommendations
- New item: use content features (title, description embeddings) to place item in embedding space; show to a random sample to collect initial engagement signals
Feature Store
Features need different freshness levels:
- Real-time features (milliseconds): current session context, last clicked item — served from Redis
- Near-real-time features (minutes): video engagement velocity, trending scores — from Kafka stream processing
- Batch features (hours/days): user long-term preferences, item quality scores — from Spark jobs to a feature store (Feast, Tecton)
Problem 3: Search Ranking (Google/LinkedIn/Amazon)
ML Task
Learning to Rank (LTR): given a query and a set of candidate documents, rank them by relevance. Three approaches:
- Pointwise: predict relevance score per (query, doc) pair. Simple, ignores relative ordering
- Pairwise: predict which of two documents is more relevant. RankNet, LambdaRank
- Listwise: optimize list-level metrics directly (NDCG). LambdaMART. Best offline metrics but harder to train
Query Understanding
# Step 1: query analysis
query = "python interview questions senior"
intent = classifier.predict(query) # navigational | informational | transactional
entities = ner.extract(query) # ["python", "interview questions", "senior"]
expanded_query = synonym_expand(query) # + "software engineer interview"
# Step 2: candidate retrieval
bm25_candidates = inverted_index.search(query, top_k=1000) # lexical match
dense_candidates = ann_index.search(query_embedding, top_k=500) # semantic match
candidates = merge_and_dedup(bm25_candidates, dense_candidates)
Ranking Features
- Query-document relevance: BM25 score, semantic similarity, query term coverage
- Document quality: PageRank, click-through rate history, freshness, authority
- Personalization: user search history affinity, location, language
- Context: time of day, device, previous queries in session
Evaluation Metrics
| Metric | What It Measures | When to Use |
|---|---|---|
| NDCG@K | Quality of top-K results with graded relevance | Search, recommendations |
| MAP | Mean average precision across queries | Information retrieval |
| AUC-ROC | Binary classification quality | CTR prediction, binary labels |
| MRR | Mean reciprocal rank — position of first relevant result | QA, navigational queries |
| Recall@K | Fraction of relevant items in top-K | Retrieval quality |
Training-Serving Skew
The most common ML system bug: the model behaves differently in production than offline evaluation. Causes:
- Features computed differently at training time vs serving time
- Training data collected from a biased distribution (position bias — top results get more clicks)
- Serving uses stale features (daily batch) but training assumes real-time freshness
- Model trained on past distribution but deployed on shifted distribution (concept drift)
Fix: use the same feature computation code for training and serving (the feature store). Log features at serving time and use them directly for training (training on logged features). Correct for position bias with Inverse Propensity Scoring (IPS).
Model Monitoring
- Data drift: input feature distributions shift (KL divergence, PSI — Population Stability Index)
- Prediction drift: model output distribution shifts without obvious feature changes
- Business metric regression: CTR, session length, revenue decline
- Model staleness: performance degrades as world changes (retraining cadence)
# Monitoring checks (run every hour)
assert psi(training_feature_dist, serving_feature_dist) < 0.1 # ctr_baseline * 0.95 # alert if CTR drops 5%+
assert p99_inference_latency_ms < 50 # SLO
Frequently Asked Questions
What is the two-stage retrieval and ranking architecture used in recommendation systems?
Two-stage architecture solves the challenge of scoring millions of items within milliseconds. Stage 1 (retrieval/candidate generation): a fast, approximate method retrieves hundreds to thousands of candidates from millions of items. Common approaches: approximate nearest neighbor search on learned embeddings (FAISS/ScaNN), collaborative filtering, BM25 for content retrieval. Stage 2 (ranking): a more expensive model (deep neural network) re-ranks the candidate set. The light pre-ranker may reduce 1,000 to 200 candidates; the heavy ranker picks the final 50. This lets you use a complex model where it matters (the final ranking) without paying its cost on millions of items.
What is training-serving skew in machine learning and how do you prevent it?
Training-serving skew occurs when the model behaves differently in production than it did during offline evaluation. Common causes: features are computed differently at training and serving time (different code paths), training data was collected with a biased distribution (position bias — top results get more clicks regardless of relevance), or serving uses stale features while training assumed fresh ones. Prevention: use a feature store that provides the same feature values in both training and serving, log features at serving time and use those exact logged values for training (training on logged features), and correct for position bias with inverse propensity scoring. Monitor for skew by comparing feature distributions between training data and live traffic.
How do you evaluate a machine learning recommendation system?
Offline metrics measure model quality on held-out data: NDCG@K (normalized discounted cumulative gain at K positions — measures ranking quality with graded relevance), AUC-ROC (ranking quality for binary labels like click/no-click), Recall@K (fraction of relevant items retrieved in top K). Online metrics measure business impact: CTR (click-through rate), session length, return visit rate, revenue per session. Offline metrics do not always correlate with online metrics — a model with higher NDCG may produce lower user retention. Always A/B test before fully launching. Watch for Goodhart Law: optimizing a proxy metric (CTR) can harm the true goal (long-term user satisfaction) if users click clickbait.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the two-stage retrieval and ranking architecture used in recommendation systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two-stage architecture solves the challenge of scoring millions of items within milliseconds. Stage 1 (retrieval/candidate generation): a fast, approximate method retrieves hundreds to thousands of candidates from millions of items. Common approaches: approximate nearest neighbor search on learned embeddings (FAISS/ScaNN), collaborative filtering, BM25 for content retrieval. Stage 2 (ranking): a more expensive model (deep neural network) re-ranks the candidate set. The light pre-ranker may reduce 1,000 to 200 candidates; the heavy ranker picks the final 50. This lets you use a complex model where it matters (the final ranking) without paying its cost on millions of items.”
}
},
{
“@type”: “Question”,
“name”: “What is training-serving skew in machine learning and how do you prevent it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Training-serving skew occurs when the model behaves differently in production than it did during offline evaluation. Common causes: features are computed differently at training and serving time (different code paths), training data was collected with a biased distribution (position bias — top results get more clicks regardless of relevance), or serving uses stale features while training assumed fresh ones. Prevention: use a feature store that provides the same feature values in both training and serving, log features at serving time and use those exact logged values for training (training on logged features), and correct for position bias with inverse propensity scoring. Monitor for skew by comparing feature distributions between training data and live traffic.”
}
},
{
“@type”: “Question”,
“name”: “How do you evaluate a machine learning recommendation system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Offline metrics measure model quality on held-out data: NDCG@K (normalized discounted cumulative gain at K positions — measures ranking quality with graded relevance), AUC-ROC (ranking quality for binary labels like click/no-click), Recall@K (fraction of relevant items retrieved in top K). Online metrics measure business impact: CTR (click-through rate), session length, return visit rate, revenue per session. Offline metrics do not always correlate with online metrics — a model with higher NDCG may produce lower user retention. Always A/B test before fully launching. Watch for Goodhart Law: optimizing a proxy metric (CTR) can harm the true goal (long-term user satisfaction) if users click clickbait.”
}
}
]
}