Recommendation System Low-Level Design

Recommendation System Low-Level Design

Recommendation systems are one of the highest-ROI components in consumer products. They solve the discovery problem: users can’t browse a catalog of millions of items, so the system must surface the right items at the right time.

Why Recommendations Matter

The business case is well-documented. Netflix reports that approximately 80% of content watched on the platform is discovered through its recommendation system rather than search. Amazon attributes roughly 35% of its revenue to its recommendation engine. These numbers make recommendation systems a top engineering priority and a frequent system design interview topic at FAANG-level companies.

Collaborative Filtering – User-Based

User-based collaborative filtering finds users similar to the target user and recommends items those similar users liked. The core data structure is a user-item matrix where rows are users, columns are items, and values are ratings or implicit signals (clicks, watch time, purchases).

Similarity between users is computed using cosine similarity or Pearson correlation:

  • Cosine similarity: treats each user’s rating vector as a point in item-space; measures the angle between vectors. Works well for implicit feedback.
  • Pearson correlation: accounts for differences in rating scales (one user rates everything 4-5, another uses the full 1-5 range). Better for explicit ratings.

Limitations: user-based CF doesn’t scale well. With 100M users, computing pairwise similarity is O(n^2) in users. Item catalogs are also typically smaller and more stable than user bases, which motivates item-based CF.

Collaborative Filtering – Item-Based

Item-based CF flips the approach: find items similar to ones the user already liked, then recommend those similar items. Item similarity is computed once offline and cached – this is the key scalability advantage over user-based CF.

Item similarity is more stable over time than user similarity. A user’s taste can change; the relationship between “users who liked The Dark Knight also liked Inception” changes slowly. Amazon pioneered item-based CF at scale with their 2003 paper.

The recommendation for user u is: for each item i the user has interacted with, find the k most similar items to i (excluding items already seen), aggregate scores weighted by item similarity, and surface the top results.

Matrix Factorization

Matrix factorization decomposes the sparse user-item matrix R (m users x n items) into two dense matrices: user embeddings U (m x k) and item embeddings V (n x k), such that R is approximately U * V^T. The latent dimension k is typically 50-300.

Training methods:

  • ALS (Alternating Least Squares): fix U, solve for V analytically, then fix V, solve for U. Parallelizes well – used by Spark MLlib. Good for implicit feedback.
  • SGD (Stochastic Gradient Descent): update both matrices simultaneously on each observed rating. Handles explicit ratings naturally.

Matrix factorization handles sparse matrices better than neighborhood methods because it learns dense representations that generalize across the full item space. The learned embeddings capture latent factors – for movies these might correspond to genre, tone, era, etc., even though they’re never explicitly labeled.

Content-Based Filtering

Content-based filtering uses item attributes to find similar items. Instead of “users who liked X also liked Y,” it reasons “X and Y share similar attributes, so if you liked X you’ll probably like Y.”

Item feature representation:

  • Categorical features: genre, tags, category – one-hot encoded or embedded
  • Text features: title, description, reviews – TF-IDF vectors or dense embeddings from a text encoder
  • Structured features: duration, release year, price – normalized scalars

User profile is constructed from the items they’ve interacted with – typically a weighted average of the item feature vectors, with more recent interactions weighted higher.

Strength: no cold start for new items (as long as you have their attributes). Weakness: over-specialization – it can only recommend items similar to what the user already knows, missing serendipitous discoveries.

Two-Tower Neural Model

The two-tower architecture has become the industry standard for large-scale recommendations (YouTube, Pinterest, TikTok, LinkedIn). Two separate neural networks – one for users, one for items – each output a dense embedding vector. Recommendation score = dot product of user embedding and item embedding.


import torch
import torch.nn as nn

class UserTower(nn.Module):
    def __init__(self, user_feature_dim, embedding_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(user_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, embedding_dim)
        )

    def forward(self, user_features):
        return nn.functional.normalize(self.net(user_features), dim=-1)

class ItemTower(nn.Module):
    def __init__(self, item_feature_dim, embedding_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(item_feature_dim, 256),
            nn.ReLU(),
            nn.Linear(256, embedding_dim)
        )

    def forward(self, item_features):
        return nn.functional.normalize(self.net(item_features), dim=-1)

# Training: maximize similarity for positive pairs, minimize for negative pairs
def contrastive_loss(user_emb, pos_item_emb, neg_item_emb, temperature=0.1):
    pos_score = torch.sum(user_emb * pos_item_emb, dim=-1) / temperature
    neg_score = torch.sum(user_emb * neg_item_emb, dim=-1) / temperature
    return -torch.log(torch.sigmoid(pos_score - neg_score)).mean()

Why two towers? The user and item towers are independent at inference time. You precompute all item embeddings offline and build an ANN (approximate nearest neighbor) index. At query time, you only need to run the user tower forward pass once, then do an ANN lookup – this is fast enough to serve in real time.

Candidate Generation vs Ranking

Production recommendation systems use a two-stage pipeline because you can’t run an expensive model over millions of items in 100ms:

Stage 1 – Candidate Generation: quickly retrieve 100-1000 candidates from millions of items. Methods: ANN search on embeddings (FAISS, ScaNN, Annoy), collaborative filtering lookup, popular items, rule-based retrieval. Speed matters more than precision here.

Stage 2 – Ranking: score the candidates with a more expensive model that uses richer features. This model can use features that are too costly to compute for every item – user-item interaction history, contextual features, real-time signals. Output: a relevance score per candidate. Sort and serve top-k.

The funnel: millions of items – candidate generation -> 100-1000 candidates – ranking model -> 10-50 final recommendations.

Feature Engineering

The quality of your features often matters more than model architecture. Standard feature categories:

  • User features: demographics (age, location), long-term preferences (genre affinity, price sensitivity), engagement history (click-through rate, watch-through rate), recency of last interaction
  • Item features: popularity (global CTR, trending score), recency (days since publication), category, quality signals (average rating, completion rate), price
  • User-item cross features: has user interacted with this item’s author before? Is this item in a category the user frequently engages with?
  • Context features: time of day (morning vs evening content preferences differ), day of week, device type (mobile vs desktop affects content format preferences), session depth (early vs late in session)

A/B Testing for Recommendations

Recommendation changes are always validated through A/B tests before full rollout. The setup:

  • Traffic split: randomly assign users to treatment or control groups (typically 50/50 or 90/10 for high-risk changes)
  • Metrics: primary metrics include CTR (click-through rate), engagement rate (watch time, read time), conversion rate; guardrail metrics include session abandonment rate, time to first interaction
  • Statistical significance: run until you have enough power to detect the minimum detectable effect (MDE) at p < 0.05 with 80%+ power; pre-calculate required sample size
  • Holdout groups: maintain a permanent holdout (1-5% of users who never see recommendation changes) to measure cumulative long-term impact vs. baseline

Common mistake: stopping the test too early when you see positive results (p-hacking). Commit to a minimum runtime before looking at results.

Cold Start Problem

Cold start affects both new users and new items:

New users: you have no behavioral data. Strategies: use demographics to find similar existing users and borrow their preferences; show popular items across categories to gather initial signals; use an explicit onboarding survey to capture stated preferences; model the signup flow context (what brought them here?).

New items: no behavioral data means collaborative filtering can’t surface them. Strategies: use content-based filtering on item attributes until behavioral data accumulates (typically 50-100 interactions); manually boost new items from trusted creators; set a “freshness” multiplier that decays as the item ages and behavioral data arrives.

Serving Architecture

The recommendation serving pipeline separates offline computation from online serving to meet latency requirements (sub-100ms):

Offline pipeline (runs daily or hourly):

  • Train or fine-tune the recommendation model on recent interaction data
  • Generate item embeddings for all items using the trained item tower
  • Build ANN index (FAISS or ScaNN) over item embeddings
  • Precompute item-item similarity for collaborative filtering
  • Write results to feature store and ANN index store

Online serving (per request, target < 100ms):

  • Retrieve user features from feature store (pre-materialized)
  • Run user tower forward pass to get user embedding (~5ms)
  • ANN search over item embedding index to get 500 candidates (~10ms)
  • Retrieve item features and cross-features for candidates (~20ms)
  • Run ranking model over candidates (~30ms)
  • Apply business rules (diversity, deduplication, content policy)
  • Return top-k results (~5ms serialization)

Netflix is the canonical recommendation system design topic. See system design questions for Netflix interview: recommendation system design.

Pinterest system design covers visual recommendation and discovery. See patterns for Pinterest interview: recommendation and content discovery design.

Snap system design covers content recommendation for Stories and Spotlight. See patterns for Snap interview: content recommendation and discovery design.

Scroll to Top