Low-Level Design: Recommendation Engine – Collaborative Filtering, Content-Based, and Hybrid (2025)

Core Entities and Schema

The recommendation engine centers on four entities. User: user_id, preferences JSONB, created_at. Item: item_id, category, features JSONB, embedding VECTOR(256) – the 256-dim embedding captures semantic content. Interaction: user_id, item_id, interaction_type (VIEW/LIKE/PURCHASE/SKIP), weight, created_at – PURCHASE weight=3.0, LIKE=2.0, VIEW=1.0, SKIP=-0.5. UserEmbedding: user_id, embedding VECTOR(256), updated_at – precomputed user taste vector, refreshed nightly or incrementally.

CREATE TABLE items (
    item_id     BIGINT PRIMARY KEY,
    category    VARCHAR(64),
    features    JSONB,
    embedding   VECTOR(256),          -- pgvector
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE interactions (
    user_id          BIGINT,
    item_id          BIGINT,
    interaction_type VARCHAR(16),     -- VIEW/LIKE/PURCHASE/SKIP
    weight           FLOAT,
    created_at       TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (user_id, item_id, interaction_type)
);

CREATE TABLE user_embeddings (
    user_id     BIGINT PRIMARY KEY,
    embedding   VECTOR(256),
    updated_at  TIMESTAMPTZ
);

-- ANN index for fast similarity search
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);

Collaborative Filtering (Item-Based)

Item-based CF: find items the user interacted with, find items similar to those based on interaction patterns. The co-occurrence matrix M[i][j] = number of users who interacted with both item i and item j. Normalize by item popularity (Jaccard or cosine on interaction vectors).

ALS (Alternating Least Squares) is the standard matrix factorization for implicit feedback. Decomposes the interaction matrix into user factors U (n_users x k) and item factors V (n_items x k). Alternately fixes U and solves for V, then fixes V and solves for U – each step is a least-squares problem with a closed-form solution. Spark MLlib’s ALS scales this to billions of interactions.

import numpy as np
from collections import defaultdict

def build_cooccurrence(interactions):
    # interactions: list of (user_id, item_id, weight)
    user_items = defaultdict(dict)
    for user_id, item_id, weight in interactions:
        user_items[user_id][item_id] = weight

    cooccur = defaultdict(lambda: defaultdict(float))
    for user_id, items in user_items.items():
        item_list = list(items.keys())
        for i in range(len(item_list)):
            for j in range(i+1, len(item_list)):
                a, b = item_list[i], item_list[j]
                cooccur[a][b] += 1
                cooccur[b][a] += 1
    return cooccur

def recommend_cf(user_id, user_items, cooccur, top_k=10):
    seen = set(user_items[user_id].keys())
    scores = defaultdict(float)
    for item_id in seen:
        for candidate, score in cooccur[item_id].items():
            if candidate not in seen:
                scores[candidate] += score
    return sorted(scores, key=scores.get, reverse=True)[:top_k]

Content-Based Filtering with Item Embeddings

Represent each item as a feature vector: category (one-hot or learned embedding), tags (TF-IDF or learned), metadata. The user profile is the weighted average of the embeddings of items they interacted with (weight = interaction weight). Score candidates by cosine similarity between user profile and item embedding.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9)

def build_user_profile(user_id, interactions, item_embeddings):
    # interactions: list of (item_id, weight)
    weighted_sum = np.zeros(256)
    total_weight = 0.0
    for item_id, weight in interactions:
        if item_id in item_embeddings:
            weighted_sum += weight * item_embeddings[item_id]
            total_weight += weight
    if total_weight == 0:
        return None
    return weighted_sum / total_weight

def content_based_recommend(user_profile, item_embeddings, seen_items, top_k=10):
    scores = {}
    for item_id, embedding in item_embeddings.items():
        if item_id not in seen_items:
            scores[item_id] = cosine_similarity(user_profile, embedding)
    return sorted(scores, key=scores.get, reverse=True)[:top_k]

# Incremental user profile update on new interaction
def update_user_profile(current_profile, current_weight, new_embedding, new_weight):
    total = current_weight + new_weight
    return (current_profile * current_weight + new_embedding * new_weight) / total

Two-Stage Architecture: Candidate Generation and Ranking

Running a full ML ranking model over millions of items is too slow. The solution is a two-stage pipeline:

Stage 1 – Candidate Generation (recall-oriented): fast ANN (Approximate Nearest Neighbor) search over item embeddings using the user embedding as query. Returns ~1000 candidates in <10ms. Uses HNSW or IVFFlat index (pgvector, Faiss, Weaviate). High recall is critical here – it’s OK to include some irrelevant items.
Stage 2 – Ranking (precision-oriented): ML model (LightGBM, two-tower neural net) with rich cross-features (user x item interactions, context, freshness, diversity penalty). Runs on 1000 candidates, returns top 20. Can take 50-100ms.

Why two stages? If ranking took 10ms per item, scoring 1M items = 10,000 seconds. ANN retrieval narrows to 1000 in 10ms; ranking 1000 items at 0.1ms each = 100ms total. This is the standard architecture used at Netflix, YouTube, LinkedIn, and Spotify.

class TwoStageRecommender:
    def __init__(self, ann_index, ranking_model, item_embeddings):
        self.ann_index = ann_index        # Faiss or pgvector
        self.ranking_model = ranking_model
        self.item_embeddings = item_embeddings

    def recommend(self, user_embedding, user_context, n_candidates=1000, top_k=20):
        # Stage 1: ANN retrieval
        candidate_ids = self.ann_index.search(user_embedding, k=n_candidates)

        # Stage 2: feature extraction + ranking
        features = []
        for item_id in candidate_ids:
            feat = self.extract_features(user_embedding, item_id, user_context)
            features.append(feat)

        scores = self.ranking_model.predict(features)
        ranked = sorted(zip(candidate_ids, scores), key=lambda x: x[1], reverse=True)
        return [item_id for item_id, _ in ranked[:top_k]]

    def extract_features(self, user_emb, item_id, context):
        item_emb = self.item_embeddings[item_id]
        return {
            'cosine_sim': float(np.dot(user_emb, item_emb)),
            'item_popularity': context['item_popularity'][item_id],
            'recency_hours': context['item_age_hours'][item_id],
            'category_match': context['category_affinity'][item_id],
        }

Real-Time vs Batch Processing

Batch (nightly): Spark job recomputes all item embeddings from scratch (ALS matrix factorization on full interaction history). Writes updated embeddings to item table and rebuilds ANN index. Handles the bulk of the model quality.

Real-time: On each user interaction, update user embedding incrementally (weighted moving average – no need to reprocess all history). Store in Redis for low-latency serving. Update interaction counts in streaming (Kafka + Flink). The serving layer reads precomputed candidate lists from Redis, applies real-time re-ranking with the latest user embedding.

Serving layer: user hits API -> fetch user embedding from Redis (1ms) -> ANN search on item index (5ms) -> ranking model inference (20ms) -> return top 20. Total p99 < 100ms. Cache results per user for 5 minutes to handle high QPS.

Cold Start Handling

Cold start is a core recommendation challenge:

New user: no interaction history, no embedding. Show popularity-based recommendations segmented by signup context (category selected at onboarding, location, device type). After 5 interactions, switch to hybrid. After 20, full personalization.
New item: no co-occurrence signal, excluded from CF. Use content-based only: compute embedding from item metadata immediately at ingestion. Eligible for content-based recommendations and ANN retrieval from day 0. Build CF signal after ~50 interactions (typically 1-3 days for popular items).
Explore/exploit tradeoff: epsilon-greedy or Thompson sampling to occasionally serve non-personalized items even to warm users, building signal on new content and preventing filter bubbles.

Interview checklist: draw the two-stage funnel (candidates -> ranking). Name ALS for CF, cosine on embeddings for content-based. Quantify the latency math for why two stages. State cold start solutions for both new users and new items. Mention the batch/real-time split and where Redis fits as the serving cache.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is a recommendation engine in system design?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A recommendation engine is a system that filters and ranks items for a user based on their past behavior, preferences, and similarities to other users or items. It is a core component in products at companies like Netflix, Amazon, and Meta, powering personalized content, product, and friend suggestions.” } }, { “@type”: “Question”, “name”: “How does Netflix design its recommendation engine?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Netflix uses a combination of collaborative filtering, content-based filtering, and deep learning models trained on viewing history, ratings, and contextual signals such as time of day and device type. The system runs multiple candidate generation models in parallel and merges results through a ranking layer before serving recommendations.” } }, { “@type”: “Question”, “name”: “What are the main components of a low-level recommendation engine design?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Key components include: a data ingestion pipeline for user interaction events, an offline batch training pipeline for embedding models, a feature store for real-time feature serving, a candidate generation service, a ranking service using learned models, and a caching layer (such as Redis) to serve recommendations at low latency.” } }, { “@type”: “Question”, “name”: “How do you scale a recommendation engine to millions of users?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Scaling strategies include pre-computing recommendations for high-traffic users and storing results in a distributed cache, using approximate nearest neighbor (ANN) search (e.g., FAISS or ScaNN) for fast embedding lookup, partitioning the user-item matrix across shards, and employing asynchronous model retraining pipelines to keep recommendations fresh without blocking reads.” } } ] }