How does the two-tower model work for recommendations at scale?

Two separate neural networks: the user tower takes user features (demographics, history, context) and outputs a user embedding. The item tower takes item features (title, category, metadata) and outputs an item embedding. Relevance = dot product of embeddings. Training: maximize similarity for positive pairs (user interacted with item), minimize for negative pairs (random items). Serving: pre-compute ALL item embeddings offline. At request time: compute user embedding (one fast forward pass), ANN search in the vector database for K nearest items. Why two towers: independence. Item embeddings are pre-computed. Only the user tower runs online. Adding new items requires only computing their embedding, not retraining. This retrieves 100-1000 candidates from millions. A separate ranking model with richer cross-features then scores these candidates for the final recommendation list.

How do you solve the cold start problem in recommendation systems?

New users (no interaction history): ask preferences during onboarding, use demographic features to map to similar users, start with popular items, and diversify as interactions accumulate. New items (no engagement data): use content features via the item tower (embeddings from metadata, not interactions), boost new items in retrieval for exploration (collect initial data), and use content-based filtering until sufficient engagement exists. Exploration vs exploitation: showing only predicted-best items (exploitation) creates filter bubbles. Inject diverse items (exploration) to discover new interests, collect data on new items, and prevent monotony. Epsilon-greedy: show predicted best (1-epsilon) of the time, random item epsilon of the time. Feedback loops: recommendations influence behavior. If the system recommends action movies and users click them (because recommended, not preferred), it recommends more. Mitigate with causal tracking and diversification.

AI/ML Interview: Recommendation Systems — Collaborative Filtering, Content-Based, Embeddings, Two-Tower, Cold Start

⏱ 6 min read

Recommendation systems drive engagement at Netflix, Spotify, Amazon, YouTube, and TikTok — generating billions of dollars in revenue. Understanding how recommendations work is essential for ML interviews at any company with a content or product catalog. This guide covers the algorithms, architectures, and practical challenges of building recommendation systems at scale.

Collaborative Filtering

Collaborative filtering recommends items based on user behavior patterns: users who liked similar items in the past will like similar items in the future. Two approaches: (1) User-based CF — find users similar to the target user (based on their rating/interaction overlap). Recommend items that similar users liked but the target user has not seen. Problem: the user-user similarity matrix is enormous (N^2 for N users) and changes constantly as users interact. (2) Item-based CF — find items similar to items the target user has liked (based on which users liked them). Recommend similar items. More stable than user-based (item similarities change less frequently). Amazon “customers who bought this also bought” is item-based CF. Matrix factorization (MF): decompose the user-item interaction matrix R (N users x M items) into two low-rank matrices: R ~ U * V^T where U is (N x K) user embeddings and V is (M x K) item embeddings. K is the embedding dimension (50-200). The predicted rating for user i and item j is the dot product of their embeddings: r_ij = u_i * v_j. Training: minimize the squared error on observed interactions (SGD or ALS). Missing entries are not zero — they are unknown. Only train on observed (non-missing) entries. MF captures latent factors: a movie embedding might implicitly encode genre, era, and mood without explicit labels.

Content-Based Filtering

Content-based filtering recommends items similar in content to what the user has previously liked. It uses item features (genre, cast, description for movies; tempo, key, artist for music) rather than other users behavior. Approach: build a user profile from the features of items they have interacted with. A user who watches many sci-fi movies has a profile weighted toward sci-fi features. Score new items by similarity between their features and the user profile. Advantages: (1) No cold start for new items — as long as the item has features, it can be recommended (unlike CF which needs interaction data). (2) Explainable — “recommended because you watched other sci-fi movies.” (3) No popularity bias — can recommend niche items with features matching the user profile. Disadvantages: (1) Limited discovery — only recommends items similar to what the user already likes. No serendipity. (2) Feature engineering — requires meaningful item features. For some domains (music audio features), this is hard. Modern approach: instead of hand-crafted features, use learned embeddings from a neural network. A pre-trained image model generates visual embeddings for products. A pre-trained language model generates text embeddings for descriptions. These embeddings capture richer semantic information than manual features.

Two-Tower Model (Deep Retrieval)

The two-tower model is the standard architecture for large-scale recommendation at Google, YouTube, and Meta. Architecture: two separate neural networks (towers): (1) User tower — takes user features (demographics, interaction history, context) as input. Outputs a user embedding vector (128-256 dimensions). (2) Item tower — takes item features (title, category, metadata, content embeddings) as input. Outputs an item embedding vector (same dimensions). Scoring: the relevance of an item for a user is the dot product (or cosine similarity) of their embeddings. Training: given positive pairs (user, item they interacted with) and negative pairs (user, random item they did not interact with), train to maximize the similarity for positive pairs and minimize for negative pairs. Contrastive loss or cross-entropy loss on the dot product scores. Serving: (1) Pre-compute item embeddings for all items (offline, stored in a vector database). (2) At request time: compute the user embedding (online, fast — single forward pass through the user tower). (3) Approximate nearest neighbor (ANN) search in the vector database: find the K items with embeddings closest to the user embedding. (4) Return the top-K as candidates for the ranking stage. Why two towers: the user and item towers are independent. Item embeddings are pre-computed. Only the user tower runs at request time (fast). Adding new items requires only computing their embeddings (no retraining).

Ranking Stage

Recommendation typically has two stages: (1) Retrieval (candidate generation) — the two-tower model retrieves 100-1000 candidates from millions of items. Fast but coarse (dot product similarity). (2) Ranking — a more complex model scores each candidate with richer features. The ranking model is a neural network (or gradient boosted tree) with features: user features (demographics, activity, context), item features (category, age, quality signals), user-item interaction features (has the user seen this item before? time since last interaction with similar items), and contextual features (time of day, device, location). The ranking model predicts: click probability, engagement probability (watch time, read time), conversion probability, and negative signals (skip, dislike, report). The final score is a weighted combination: score = w1*P(click) + w2*P(engagement) + w3*P(purchase) – w4*P(skip). These weights encode business objectives: maximize engagement but do not show content users will skip. The ranking model is more expensive per-item than retrieval (uses cross-features between user and item) but processes only 100-1000 candidates (not millions). This two-stage pipeline balances coverage (retrieval finds all relevant items) with precision (ranking selects the best).

Cold Start and Practical Challenges

Cold start: new users have no interaction history; new items have no engagement data. Solutions: (1) New user — ask for preferences during onboarding (select categories, rate sample items). Use demographic features to map to similar users. Start with popular items and diversify as interactions accumulate. (2) New item — use content features (content-based filtering). Boost new items in the retrieval stage (exploration) to collect initial engagement data. Use the item metadata tower in the two-tower model (item embeddings from features, not interactions). Exploration vs exploitation: showing only items the user is predicted to like (exploitation) creates filter bubbles. Inject diverse items (exploration) to: discover new user interests, collect data on new items, and prevent monotony. Epsilon-greedy: show the predicted best item (1-epsilon) of the time and a random item epsilon of the time. Feedback loops: recommendations influence user behavior. If the system recommends action movies and the user clicks them (because they were recommended, not because they prefer them), the system recommends more action movies. Mitigation: track causal engagement (would the user have watched this without the recommendation?) and diversify recommendations. Bias: popularity bias (popular items get recommended more, become more popular), position bias (items shown first get more clicks), and selection bias (the model only trains on items it recommended). Address with: inverse propensity weighting, position-aware training, and explicit exploration.