Recommendation systems power product discovery at Netflix, Spotify, Amazon, and LinkedIn. Designing one at scale involves a multi-stage architecture that balances relevance quality with latency constraints serving millions of users.
Requirements
Functional: Return N personalized item recommendations per user, support real-time signals (recent clicks), handle cold start for new users/items, support multiple contexts (homepage, item detail, email).
Non-functional: < 100ms p99 for recommendation serving, handle 10M DAU, update models daily (batch) and recommendations in near-real-time (stream), support A/B testing of recommendation algorithms.
Multi-Stage Recommendation Architecture
All items (millions)
│
▼
[Candidate Generation] ← retrieval: narrow to ~1000 candidates
│ (ANN search, collaborative filtering, content rules)
│
▼
[Pre-Ranking / Filtering] ← remove seen, blocked, low-quality
│
▼
[Ranking Model] ← ML model scores each candidate (100ms budget)
│ (gradient boosted trees, deep learning ranker)
│
▼
[Post-Processing] ← diversity injection, business rules, A/B
│
▼
Final N recommendations
Candidate Generation: Collaborative Filtering
Matrix Factorization
User-Item interaction matrix R (sparse):
Items →
U [5, ?, 3, ?, 1]
s [?, 4, ?, 5, ?]
e [1, ?, ?, 3, 5]
r ...
s
↓
Factorize: R ≈ U × Vᵀ
U: user embeddings (n_users × k)
V: item embeddings (n_items × k)
k = 64-256 latent factors
Prediction: r̂_ui = u_i · v_j (dot product)
Training: minimize Σ(r_ui - u_i · v_j)² + λ(‖u_i‖² + ‖v_j‖²)
via ALS (Alternating Least Squares) or SGD
Two-Tower Neural Model
User Tower Item Tower
──────────── ──────────
user_id embedding item_id embedding
+ user features + item features
+ recent history + content features
│ │
Dense layers Dense layers
│ │
user vector (256d) ──── item vector (256d)
cosine similarity → score
Training: contrastive loss (positive pairs from interactions,
negative sampling from random items)
Two-tower models allow pre-computing item embeddings offline. At serving time, only the user tower runs, and ANN search finds nearest item vectors in milliseconds.
Approximate Nearest Neighbor (ANN) Search
Problem: find top-K closest item vectors to user vector
from 10M item embeddings in < 10ms
Algorithms:
HNSW (Hierarchical Navigable Small World):
- Multi-layer graph; traverse from top layer to bottom
- ~95% recall at 10x speedup vs brute force
- Memory: O(n × d × 4 bytes) = 10M × 256 × 4 = 10GB
FAISS (Facebook AI Similarity Search):
- Supports IVF (inverted file index): cluster items, search top clusters
- IVFFlat: exact within clusters, approximate overall
- GPU acceleration for batch serving
Services: Pinecone, Weaviate, Milvus, Elasticsearch kNN
Ranking Model
Input features per (user, item) pair:
User features: age, country, device, account_age, preferences
Item features: category, popularity, recency, avg_rating
Interaction: historical CTR for this user × category
Context: time_of_day, surface (homepage vs search)
Cross features: user_category_affinity, item_user_overlap
Model: LightGBM (fast, interpretable) or
Deep & Cross Network (DCN) for feature crosses
Output: P(click), P(purchase), P(watch_completion)
→ weighted combination = ranking score
Serving: pre-score at candidate generation time (batch for
returning users), real-time for personalization signals
Cold Start Problem
New User Cold Start
- Onboarding signals: ask users to rate seed items or select interests
- Demographic-based: serve popular items in user’s country/age cohort
- Session-based: update recommendations after first 2-3 interactions using session context model (RNN/Transformer over click sequence)
New Item Cold Start
- Content-based: embed item using text/image features; find nearest existing items in embedding space
- Exploration injection: insert new items in ranked list with boosted score for first N impressions to gather interaction data
- Warm-up period: use content features for ranking; switch to collaborative features after 100+ interactions
Real-Time Personalization
User clicks item → Kafka event
→ Feature pipeline: update user session features (Redis, TTL 30min)
→ Stream processor: rerank current recommendation slate
→ A/B framework: route to correct model variant
→ Cache invalidation: bust user's cached recommendations
Architecture:
Flink consumer → user session store (Redis)
→ online feature store (Feast/Tecton)
→ trigger re-ranking if session changed significantly
Diversity and Serendipity
Pure relevance optimization creates filter bubbles and repetitive recommendations. Post-processing injects diversity:
- MMR (Maximal Marginal Relevance): iteratively select next item maximizing λ × relevance – (1-λ) × similarity to already selected items
- Category caps: max 2 items per category in top-10
- Freshness injection: reserve slots for items newer than 7 days
- Exploration slots: ε-greedy: 5% of slots serve exploratory items to combat popularity bias
Offline Evaluation vs Online Metrics
| Metric | Stage | Measures |
|---|---|---|
| Recall@K, NDCG@K | Offline | Ranking quality on held-out interactions |
| AUC-ROC | Offline | Discriminative power of ranker |
| CTR, CVR | Online A/B | User engagement |
| Session length, D7 retention | Online A/B | Long-term user satisfaction |
| Catalog coverage | Online | Diversity of items surfaced |
Survivorship bias: offline metrics only evaluate on items users saw (logged policy). Use inverse propensity scoring or counterfactual evaluation to correct for this.
System Design Summary
- Scale: Two-tower model for retrieval (ANN over 10M item embeddings), LightGBM/DCN for ranking ~1000 candidates
- Latency: ANN search < 10ms, ranking < 50ms, post-processing < 10ms = 70ms total
- Freshness: Flink pipeline for real-time user session features; model retrained daily
- Infrastructure: Kafka → Flink → Feast (feature store) → Serving layer (TF Serving / Triton)
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a two-tower model and why is it used for recommendation retrieval?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A two-tower model (dual encoder) uses two separate neural networks u2014 one for users and one for items u2014 that each produce a fixed-dimension embedding vector. Similarity between user and item is computed as dot product or cosine similarity. The key advantage: item embeddings can be pre-computed offline and indexed in an Approximate Nearest Neighbor (ANN) index (e.g., HNSW, FAISS). At serving time, only the user tower runs on live features, then ANN search retrieves the top-K closest item vectors in milliseconds u2014 no need to score millions of items at query time.”
}
},
{
“@type”: “Question”,
“name”: “How do recommendation systems solve the cold start problem for new users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “New user cold start is addressed through multiple strategies: (1) Onboarding: ask users to select interests or rate seed items to bootstrap preferences. (2) Demographic fallback: serve popular items segmented by country, device, or referral source. (3) Session-based models: after the first 2-3 interactions, a session context model (RNN or Transformer over click sequence) generates personalized candidates without requiring historical data. (4) Cross-domain transfer: if the user has a connected account (e.g., Facebook login), import signals from other platforms with consent.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between collaborative filtering and content-based filtering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Collaborative filtering recommends items based on the behavior of similar users u2014 “users who liked X also liked Y” u2014 using the interaction matrix (ratings, clicks, purchases). It requires no item content understanding but suffers from cold start and data sparsity. Content-based filtering recommends items similar to ones the user has interacted with, using item features (genre, description, tags) u2014 it works for new items with no interactions but creates filter bubbles (only recommends similar items). Production systems combine both: collaborative filtering for personalization, content-based for cold start and diversity, with a learning-to-rank model that fuses signals.”
}
}
]
}