Recommendation System Low-Level Design
Recommendation systems are one of the highest-ROI components in consumer products. They solve the discovery problem: users can’t browse a catalog of millions of items, so the system must surface the right items at the right time.
Why Recommendations Matter
The business case is well-documented. Netflix reports that approximately 80% of content watched on the platform is discovered through its recommendation system rather than search. Amazon attributes roughly 35% of its revenue to its recommendation engine. These numbers make recommendation systems a top engineering priority and a frequent system design interview topic at FAANG-level companies.
Collaborative Filtering – User-Based
User-based collaborative filtering finds users similar to the target user and recommends items those similar users liked. The core data structure is a user-item matrix where rows are users, columns are items, and values are ratings or implicit signals (clicks, watch time, purchases).
Similarity between users is computed using cosine similarity or Pearson correlation:
- Cosine similarity: treats each user’s rating vector as a point in item-space; measures the angle between vectors. Works well for implicit feedback.
- Pearson correlation: accounts for differences in rating scales (one user rates everything 4-5, another uses the full 1-5 range). Better for explicit ratings.
Limitations: user-based CF doesn’t scale well. With 100M users, computing pairwise similarity is O(n^2) in users. Item catalogs are also typically smaller and more stable than user bases, which motivates item-based CF.
Collaborative Filtering – Item-Based
Item-based CF flips the approach: find items similar to ones the user already liked, then recommend those similar items. Item similarity is computed once offline and cached – this is the key scalability advantage over user-based CF.
Item similarity is more stable over time than user similarity. A user’s taste can change; the relationship between “users who liked The Dark Knight also liked Inception” changes slowly. Amazon pioneered item-based CF at scale with their 2003 paper.
The recommendation for user u is: for each item i the user has interacted with, find the k most similar items to i (excluding items already seen), aggregate scores weighted by item similarity, and surface the top results.
Matrix Factorization
Matrix factorization decomposes the sparse user-item matrix R (m users x n items) into two dense matrices: user embeddings U (m x k) and item embeddings V (n x k), such that R is approximately U * V^T. The latent dimension k is typically 50-300.
Training methods:
- ALS (Alternating Least Squares): fix U, solve for V analytically, then fix V, solve for U. Parallelizes well – used by Spark MLlib. Good for implicit feedback.
- SGD (Stochastic Gradient Descent): update both matrices simultaneously on each observed rating. Handles explicit ratings naturally.
Matrix factorization handles sparse matrices better than neighborhood methods because it learns dense representations that generalize across the full item space. The learned embeddings capture latent factors – for movies these might correspond to genre, tone, era, etc., even though they’re never explicitly labeled.
Content-Based Filtering
Content-based filtering uses item attributes to find similar items. Instead of “users who liked X also liked Y,” it reasons “X and Y share similar attributes, so if you liked X you’ll probably like Y.”
Item feature representation:
- Categorical features: genre, tags, category – one-hot encoded or embedded
- Text features: title, description, reviews – TF-IDF vectors or dense embeddings from a text encoder
- Structured features: duration, release year, price – normalized scalars
User profile is constructed from the items they’ve interacted with – typically a weighted average of the item feature vectors, with more recent interactions weighted higher.
Strength: no cold start for new items (as long as you have their attributes). Weakness: over-specialization – it can only recommend items similar to what the user already knows, missing serendipitous discoveries.
Two-Tower Neural Model
The two-tower architecture has become the industry standard for large-scale recommendations (YouTube, Pinterest, TikTok, LinkedIn). Two separate neural networks – one for users, one for items – each output a dense embedding vector. Recommendation score = dot product of user embedding and item embedding.
import torch
import torch.nn as nn
class UserTower(nn.Module):
def __init__(self, user_feature_dim, embedding_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(user_feature_dim, 256),
nn.ReLU(),
nn.Linear(256, embedding_dim)
)
def forward(self, user_features):
return nn.functional.normalize(self.net(user_features), dim=-1)
class ItemTower(nn.Module):
def __init__(self, item_feature_dim, embedding_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(item_feature_dim, 256),
nn.ReLU(),
nn.Linear(256, embedding_dim)
)
def forward(self, item_features):
return nn.functional.normalize(self.net(item_features), dim=-1)
# Training: maximize similarity for positive pairs, minimize for negative pairs
def contrastive_loss(user_emb, pos_item_emb, neg_item_emb, temperature=0.1):
pos_score = torch.sum(user_emb * pos_item_emb, dim=-1) / temperature
neg_score = torch.sum(user_emb * neg_item_emb, dim=-1) / temperature
return -torch.log(torch.sigmoid(pos_score - neg_score)).mean()
Why two towers? The user and item towers are independent at inference time. You precompute all item embeddings offline and build an ANN (approximate nearest neighbor) index. At query time, you only need to run the user tower forward pass once, then do an ANN lookup – this is fast enough to serve in real time.
Candidate Generation vs Ranking
Production recommendation systems use a two-stage pipeline because you can’t run an expensive model over millions of items in 100ms:
Stage 1 – Candidate Generation: quickly retrieve 100-1000 candidates from millions of items. Methods: ANN search on embeddings (FAISS, ScaNN, Annoy), collaborative filtering lookup, popular items, rule-based retrieval. Speed matters more than precision here.
Stage 2 – Ranking: score the candidates with a more expensive model that uses richer features. This model can use features that are too costly to compute for every item – user-item interaction history, contextual features, real-time signals. Output: a relevance score per candidate. Sort and serve top-k.
The funnel: millions of items – candidate generation -> 100-1000 candidates – ranking model -> 10-50 final recommendations.
Feature Engineering
The quality of your features often matters more than model architecture. Standard feature categories:
- User features: demographics (age, location), long-term preferences (genre affinity, price sensitivity), engagement history (click-through rate, watch-through rate), recency of last interaction
- Item features: popularity (global CTR, trending score), recency (days since publication), category, quality signals (average rating, completion rate), price
- User-item cross features: has user interacted with this item’s author before? Is this item in a category the user frequently engages with?
- Context features: time of day (morning vs evening content preferences differ), day of week, device type (mobile vs desktop affects content format preferences), session depth (early vs late in session)
A/B Testing for Recommendations
Recommendation changes are always validated through A/B tests before full rollout. The setup:
- Traffic split: randomly assign users to treatment or control groups (typically 50/50 or 90/10 for high-risk changes)
- Metrics: primary metrics include CTR (click-through rate), engagement rate (watch time, read time), conversion rate; guardrail metrics include session abandonment rate, time to first interaction
- Statistical significance: run until you have enough power to detect the minimum detectable effect (MDE) at p < 0.05 with 80%+ power; pre-calculate required sample size
- Holdout groups: maintain a permanent holdout (1-5% of users who never see recommendation changes) to measure cumulative long-term impact vs. baseline
Common mistake: stopping the test too early when you see positive results (p-hacking). Commit to a minimum runtime before looking at results.
Cold Start Problem
Cold start affects both new users and new items:
New users: you have no behavioral data. Strategies: use demographics to find similar existing users and borrow their preferences; show popular items across categories to gather initial signals; use an explicit onboarding survey to capture stated preferences; model the signup flow context (what brought them here?).
New items: no behavioral data means collaborative filtering can’t surface them. Strategies: use content-based filtering on item attributes until behavioral data accumulates (typically 50-100 interactions); manually boost new items from trusted creators; set a “freshness” multiplier that decays as the item ages and behavioral data arrives.
Serving Architecture
The recommendation serving pipeline separates offline computation from online serving to meet latency requirements (sub-100ms):
Offline pipeline (runs daily or hourly):
- Train or fine-tune the recommendation model on recent interaction data
- Generate item embeddings for all items using the trained item tower
- Build ANN index (FAISS or ScaNN) over item embeddings
- Precompute item-item similarity for collaborative filtering
- Write results to feature store and ANN index store
Online serving (per request, target < 100ms):
- Retrieve user features from feature store (pre-materialized)
- Run user tower forward pass to get user embedding (~5ms)
- ANN search over item embedding index to get 500 candidates (~10ms)
- Retrieve item features and cross-features for candidates (~20ms)
- Run ranking model over candidates (~30ms)
- Apply business rules (diversity, deduplication, content policy)
- Return top-k results (~5ms serialization)
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is collaborative filtering and how does it work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Collaborative filtering recommends items based on the behavior of similar users or items, without requiring knowledge of item content. User-based CF: find users with similar tastes (high cosine similarity or Pearson correlation on their rating vectors), recommend items highly rated by similar users but not yet seen by the target user. Item-based CF: find items similar to those the user has liked (item-item similarity matrix), recommend similar items. Item-based CF is more stable (item similarities change less than user similarities) and scales better since items are fewer than users in most systems. Netflix and Amazon originally used item-based CF before moving to deep learning. Cold start problem: new users or items have no ratings, so CF cannot make recommendations – fall back to popularity or content-based filtering.”}},{“@type”:”Question”,”name”:”What is a two-tower model and why is it used for recommendations?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A two-tower (dual encoder) model has separate neural networks for users and items, each outputting a dense embedding vector. The relevance score between a user and item is the dot product of their embeddings. Training: sample (user, item, label) triples where label=1 for positive interactions and label=0 for random negatives. Optimize with binary cross-entropy or contrastive loss. At serving time: precompute all item embeddings offline, build an ANN (approximate nearest neighbor) index. For each user request: compute user embedding online (100ms), run ANN search to find top-1000 similar items (10ms). Two-tower is used by YouTube (2016 paper), Pinterest, TikTok because: linear serving cost (ANN instead of scoring all items), handles billions of items, easily updated with new user behavior.”}},{“@type”:”Question”,”name”:”What is the two-stage recommendation architecture?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Stage 1 – Candidate Generation: quickly narrow from billions of items to hundreds of candidates. Use fast methods: ANN search on embeddings, collaborative filtering with precomputed item-item similarity, rule-based (trending, new releases). Goal: recall – retrieve all potentially relevant items. Can sacrifice precision. Stage 2 – Ranking: score the candidates with a more expensive model that uses richer features (user-item interaction history, real-time context, cross-features). Goal: precision – reorder candidates by predicted user engagement. A lightweight model in stage 1 (milliseconds) and a heavier model in stage 2 (still fast since only 100-1000 items). The two-stage design allows total serving latency under 100ms while maintaining recommendation quality.”}},{“@type”:”Question”,”name”:”How do you solve the cold start problem in recommendations?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Cold start affects new users and new items. New user strategies: (1) Onboarding questionnaire: ask for explicit preferences to bootstrap the profile. (2) Demographic-based recommendations: use age, location, device to suggest popular items for similar demographics. (3) Popularity-based fallback: recommend trending or top-N items globally. (4) Implicit signals: even without explicit ratings, page dwell time, search queries, and click patterns from the first session can seed a basic profile. New item strategies: (1) Content-based embedding: compute item embedding from title, description, category, tags – before any user interactions. (2) Expert-curated boosting: manually promote new items to get initial exposure. (3) Exploration component: epsilon-greedy or Thompson sampling to occasionally show new items.”}},{“@type”:”Question”,”name”:”How do you A/B test a recommendation algorithm?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Randomly split users into control (existing algorithm) and treatment (new algorithm) groups – persistent assignment (same user always in same group). Run for 1-2 weeks to capture weekly behavioral cycles. Primary metrics: click-through rate (CTR), watch time, conversion rate, long-term retention. Guard rail metrics: user complaints, unsubscribes. Statistical significance: use t-test or Mann-Whitney U test, p-value < 0.05. Effect size: minimum detectable effect (MDE) determines required sample size. Novelty effect: users may engage more with any change simply due to novelty – run the test long enough for novelty to wear off. Holdout groups: maintain a 5-10% holdout that never receives new algorithms, used for long-term impact measurement.”}}]}
Netflix is the canonical recommendation system design topic. See system design questions for Netflix interview: recommendation system design.
Pinterest system design covers visual recommendation and discovery. See patterns for Pinterest interview: recommendation and content discovery design.
Snap system design covers content recommendation for Stories and Spotlight. See patterns for Snap interview: content recommendation and discovery design.