System Design Interview: Recommendation System (Netflix / Spotify)

⏱ 8 min read

Why Recommendations Matter

Netflix reports that 80% of content watched is discovered through recommendations. Spotify generates 30% of its streams through algorithmic playlists (Discover Weekly, Daily Mix). A well-designed recommendation system is the single highest-leverage product feature for engagement-driven platforms. This question tests ML system design, data pipeline architecture, and the ability to balance offline model training with real-time personalization.

Three Phases: Retrieval, Ranking, Reranking

Recommendation pipelines have three stages that trade off speed for accuracy: (1) Retrieval (candidate generation): from millions of items, quickly narrow to hundreds of candidates. Must complete in under 50ms. Uses fast approximate methods — ANN (Approximate Nearest Neighbor) search, user-based collaborative filtering, or item-item similarity lookup. (2) Ranking: score and sort the hundreds of candidates. A more expensive ML model runs here — typically a deep learning model with user/item features, context, and interaction history. Completes in 100-200ms. (3) Reranking: business logic layer on top of the ML ranking. Boost new releases, penalize recently watched content, enforce diversity (no more than 3 items from the same genre in top-10), apply A/B test variants. Sub-millisecond rule execution.

Collaborative Filtering

User-based collaborative filtering: find users similar to the target user (same viewing history), then recommend items those users liked that the target has not seen yet. Item-based collaborative filtering: find items similar to what the user watched, recommend those. Item-based scales better — item-item similarity is precomputed offline and reused for all users; user-user similarity requires recomputation as users interact.


# Item-item collaborative filtering
# Precomputed offline: item similarity matrix
item_similarity = {
    "Stranger Things": {"Dark": 0.92, "Dark Tourist": 0.71, ...},
    ...
}

def recommend(user_id, n=10):
    watched = get_user_watch_history(user_id)
    candidates = {}
    for item in watched[-10:]:  # use recent history
        for similar_item, score in item_similarity[item].items():
            if similar_item not in watched:
                candidates[similar_item] = candidates.get(similar_item, 0) + score
    return sorted(candidates, key=candidates.get, reverse=True)[:n]

Matrix Factorization

Matrix factorization decomposes the user-item interaction matrix (users x items) into two lower-dimensional matrices: user embeddings (users x k) and item embeddings (items x k), where k is the latent dimension (typically 64-256). The predicted rating for user u and item i is the dot product of their embedding vectors. Training: minimize the difference between predicted and actual ratings using SGD or ALS (Alternating Least Squares). The embeddings capture latent factors — genre preferences, mood, style — without needing explicit feature engineering. Netflix uses a variant called Neural Collaborative Filtering that replaces the dot product with a neural network for more expressive interaction modeling.

Two-Tower Model

The two-tower architecture is the dominant deep learning approach for recommendation retrieval. Two separate neural networks (towers) encode users and items into the same embedding space. The user tower takes user features (watch history, demographics, time of day, device) and outputs a user embedding. The item tower takes item features (genre, actors, duration, language) and outputs an item embedding. Similarity is the dot product. During training, contrastive learning pushes user embeddings close to items they liked and away from items they did not. At serving time: precompute all item embeddings offline, build an ANN index (FAISS, ScaNN), compute the user embedding online, then run ANN search to retrieve the nearest 200 items in under 10ms.

Feature Store

A feature store is a centralized system for computing, storing, and serving ML features consistently between training and serving. The training-serving skew problem: if features are computed differently offline (training) vs online (serving), model accuracy degrades silently in production. The feature store solves this by defining features once and reusing the same computation everywhere. Architecture: online store (Redis/DynamoDB) serves precomputed features at low latency for real-time requests; offline store (S3/Hive) stores historical feature values for training. Features are versioned — experiments can use different feature versions. Point-in-time correct joins in the offline store ensure training data does not leak future information (a common source of evaluation overfit).

Real-Time vs Batch Recommendations

Batch recommendations are precomputed for all users on a schedule (nightly). Simple to operate — recommendation results are stored in a key-value store and looked up at serving time. Works well for stable preferences. Real-time recommendations update instantly based on the current session — if a user just watched a horror movie, the next recommendation should be horror, not comedy. Real-time requires: (1) streaming the user action to a feature update pipeline (Kafka), (2) updating user session features in Redis, (3) re-running the retrieval and ranking with fresh features. Netflix uses a hybrid: batch precomputation for the row-by-row page layout, with real-time context (the video just watched) used to re-rank within rows.

Evaluation

Precision@K: fraction of recommended items the user actually liked (out of top-K)
Recall@K: fraction of items the user likes that appear in top-K
NDCG (Normalized Discounted Cumulative Gain): relevance-weighted ranking metric
A/B test: online experiment measuring clicks, engagement time, subscription renewal
Offline metrics predict but do not guarantee online improvement — always A/B test

Interview Tips

Three-phase pipeline (retrieval, ranking, reranking) is the expected structure
Two-tower model + FAISS ANN search is the modern industry answer for retrieval
Feature store addresses training-serving skew — mention this explicitly
Diversity and freshness constraints in reranking show product thinking
Cold start problem: new users get popularity-based recommendations; new items get content-based similarity

Frequently Asked Questions

What are the three stages of a recommendation system pipeline?

Production recommendation systems use three sequential stages: (1) Retrieval (candidate generation): reduce millions of items to hundreds of candidates in under 50ms. Uses fast approximate methods — ANN (Approximate Nearest Neighbor) search on learned embeddings, collaborative filtering, or popular items. Speed is more important than perfect accuracy here. (2) Ranking: score and sort the hundreds of candidates using a more expressive ML model (deep neural network) that can incorporate rich user and item features, contextual signals, and interaction history. Runs in 100-200ms. (3) Reranking: apply business logic on top of ML ranking — boost new releases, enforce diversity (no more than 3 same-genre items in top-10), apply A/B experiment variants, penalize recently watched content. Sub-millisecond rule execution. Each stage filters the candidate set further while increasing accuracy.

How does collaborative filtering work?

Collaborative filtering recommends items based on the behavior of similar users or the similarity of items, without requiring explicit item features. User-based CF: find users whose interaction history is most similar to the target user (using cosine similarity or Pearson correlation), then recommend items those similar users liked that the target has not seen. Item-based CF: find items similar to what the user already liked, then recommend those similar items. Item-based scales better — item similarities are precomputed offline and reused for all users, while user-user similarity requires recomputation as new users interact. Matrix factorization (SVD, ALS) is a more powerful form of collaborative filtering that learns latent embeddings for users and items by decomposing the user-item interaction matrix — predicted rating is the dot product of user and item embeddings.

How do you handle the cold start problem in recommendation systems?

Cold start occurs when there is no interaction data: (1) New user cold start: the system has no history for this user. Solutions: ask onboarding questions (Spotify asks about genres, Netflix about favorite shows), show popularity-based recommendations (most watched globally or in the user region), use demographic signals (age, location) if available, or use session-based recommendations that adapt after just a few interactions in the current session. (2) New item cold start: a newly added item has no ratings. Solutions: content-based similarity (use item features like genre, actors, description to find similar items with existing ratings and borrow their audience), manual editorial placement (Netflix manually curates new show placements), or exploration budgets (show new items to a small random fraction of users to collect initial interaction data quickly).

Shopify Interview Guide

Airbnb Interview Guide

LinkedIn Interview Guide

Twitter Interview Guide

Meta Interview Guide

Netflix Interview Guide