Design a Recommendation Engine (Netflix-style)

⏱ 5 min read

Design a recommendation engine like Netflix’s, Spotify’s Discover Weekly, or Amazon’s “Customers also bought.” Recommendation systems are one of the most impactful ML applications in production — Netflix attributes 80% of content watched to recommendations. This problem tests both system design depth and ML knowledge.

Requirements Clarification

Goal: Surface content the user will enjoy but hasn’t discovered. Maximize engagement (watch time, not just clicks).
Scale: 300M users, 20,000 titles (Netflix). 1B recommendation requests/day.
Latency: Homepage recommendations: <200ms. Real-time “up next”: <50ms.
Signals: Watch history, ratings, search queries, time-of-day, device type, content metadata (genre, cast, director), social signals.
Cold start: New users have no history. New content has no engagement data.

Two-Stage Architecture: The Industry Standard

Serving personalized recommendations from 20,000 titles in 200ms is impossible if you score every title with a complex ML model for every user at request time. The solution used by every major recommendation system: a two-stage pipeline.

Stage 1 — Candidate Generation (Retrieval)
  Goal: narrow 20,000 titles → ~500 candidates fast
  Method: ANN search on learned embeddings, collaborative filtering, rule-based filters
  Latency budget: ~20ms

Stage 2 — Ranking
  Goal: score the ~500 candidates, return top 25 for display
  Method: deep neural network with many features
  Latency budget: ~100ms

Stage 3 — Re-ranking (business rules)
  Goal: apply diversity, freshness, licensing, and A/B test constraints
  Latency budget: ~10ms

Stage 1: Candidate Generation

Collaborative Filtering

The core insight: users who behaved similarly in the past will like similar content. “Users like you also watched Squid Game.”

Matrix Factorization (ALS): Decompose the user-item interaction matrix into user embeddings (U) and item embeddings (V). User-item affinity = U_user · V_item. Learn embeddings by minimizing reconstruction error on observed interactions.

from implicit import als

# user_item_matrix: sparse matrix, shape (n_users, n_items)
# Value = watch percentage (0 to 1), or binary (watched/not watched)

model = als.AlternatingLeastSquares(factors=128, iterations=20, regularization=0.01)
model.fit(user_item_matrix)

# Get recommendations for a user
recommendations = model.recommend(user_id=42, user_items=user_item_matrix[42], N=500)
# Returns top 500 items sorted by predicted score

ALS (Alternating Least Squares) scales to Netflix-size matrices on a Spark cluster. Netflix trains this weekly and serves recommendations from pre-computed candidate sets stored in Redis.

Two-Tower Neural Network

Modern candidate generation: a two-tower model with a user tower (encodes user history and context into an embedding) and an item tower (encodes item metadata into an embedding). Trained with contrastive loss — watched items should have high dot product with user embedding, unwatched items should have low dot product.

import torch
import torch.nn as nn

class UserTower(nn.Module):
    def __init__(self, user_vocab_size, history_vocab_size, embed_dim=128):
        super().__init__()
        self.user_embed = nn.Embedding(user_vocab_size, embed_dim)
        self.history_encoder = nn.Sequential(
            nn.Linear(embed_dim, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )

    def forward(self, user_id, watch_history_ids):
        user_emb = self.user_embed(user_id)
        history_embs = self.user_embed(watch_history_ids).mean(dim=1)  # mean pooling
        return self.history_encoder(user_emb + history_embs)

class ItemTower(nn.Module):
    def __init__(self, n_genres, embed_dim=128):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_genres + 10, 256), nn.ReLU(),
            nn.Linear(256, embed_dim)
        )

    def forward(self, genre_features, other_features):
        x = torch.cat([genre_features, other_features], dim=-1)
        return self.layers(x)

At serving time: pre-compute item tower embeddings for all titles offline. Store in a vector database (Faiss, Qdrant). At request time: compute user tower embedding (fast, only user features), run ANN search against item embeddings → 500 candidates in ~10ms.

Stage 2: Ranking

The ranking model scores each candidate with richer features that are too expensive to compute at retrieval scale:

User features: account age, device, country, time of day, recent searches
Item features: age of content, genre, cast, trailer completion rate, global popularity
User-item interaction features: did they watch the trailer? Did they add it to their list?
Context features: is this for the homepage or “similar titles”?

Architecture: a deep neural net with wide (memorization via cross-product features) and deep (generalization via embedding layers) components — Google’s Wide & Deep architecture, widely adopted for recommendation ranking.

Training objective: multi-task — simultaneously predict probability of click, watch time, rating, and share. The final score is a weighted combination that reflects business priorities (watch time weighted more than clicks to avoid clickbait).

Stage 3: Re-ranking and Business Rules

Diversity: Don’t show 10 thrillers in a row — ensure genre and mood variety in the top 25.
Freshness: Boost recently added content to drive discovery. Netflix injects new releases into recommendations even at some cost to predicted CTR.
Licensing constraints: Content may not be licensed in the user’s country. Filter at this stage.
A/B test variants: Some users get an experimental ranking algorithm — apply the variant’s re-ranking logic.

Handling Cold Start

New user cold start: Ask for genre preferences during onboarding. Use demographic signals (country, device, referral source). Fall back to globally popular content segmented by content type. After 3–5 interactions, collaborative filtering can start contributing.

New item cold start: Item tower embedding is available immediately from content metadata. Collaborative filtering component contributes zero until the item has engagement history. Actively promote new content through editorial placements and “new releases” carousels to accumulate signal quickly.

Training and Serving Infrastructure

Offline (daily batch):
  User interaction logs → Spark → Train two-tower model + ALS
  → Compute user embeddings (all 300M users) → Redis
  → Compute item embeddings (20K items) → Faiss index

Online (request time, <200ms):
  User ID → fetch pre-computed user embedding from Redis (~1ms)
  → ANN search in Faiss index → 500 candidates (~10ms)
  → Fetch candidate features from feature store (~20ms)
  → Ranking model inference (GPU) → top 25 (~100ms)
  → Re-ranking rules → return

Evaluation

Offline metrics: Hit rate @ K (is the item the user watched in the top K?), NDCG (did the items the user liked appear near the top?), precision/recall
Online metrics: Click-through rate, watch time per session, long-term retention — the metrics that actually matter for the business
A/B testing: Never deploy a recommendation change without an A/B test. Offline metrics are imperfect proxies for online business metrics

Interview Follow-ups

How would you design the “Because you watched X” explanation for recommendations?
How do you prevent popularity bias — the model always recommending the same 100 blockbusters?
How does the system handle a user who shares an account with their family? (Multiple distinct taste profiles on one account.)
How would you design recommendations for a brand-new platform with no interaction data at all?
How do you measure and mitigate filter bubbles — where recommendations trap users in a narrow content niche?

Embeddings and Vector Databases — two-tower models produce user and item embeddings; ANN search (HNSW/IVF) retrieves candidates in the retrieval stage
Caching Strategies — pre-computed recommendation feeds cached in Redis sorted sets with score = ranking score; invalidated on new interaction events
Message Queues — Kafka streams click/watch events to the feature store for real-time signal computation in the ranking stage
Design an Ad Click Aggregation System — same Lambda architecture pattern: real-time stream for immediate signal, batch retraining for model updates
Database Sharding — user-item interaction matrix sharded by user_id; item embeddings replicated to all shards for local candidate scoring

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.