Search ranking is one of the most technically demanding ML system design problems. It combines information retrieval, multi-stage ranking, real-time feature computation, and online experimentation. This question appears at Google, Bing, Amazon, LinkedIn, and any company with a search product.
Step 1: Clarify the Problem
- What are we ranking? Web pages (Google), products (Amazon), job listings (LinkedIn), social posts (Twitter/X), or internal documents?
- What signals do we have? Query-document relevance, user behavior (clicks, dwell time, purchases), freshness, authority (PageRank), personalization?
- What are the constraints? Latency (100ms for web search, 500ms for e-commerce), result diversity, fairness requirements.
- How do we measure success? NDCG@10, MRR, click-through rate, conversion rate, revenue per query.
Assume: e-commerce product search (Amazon-style), 100K queries/second, personalized ranking, <200ms P99 latency.
Multi-Stage Retrieval and Ranking Architecture
The fundamental insight is that you cannot run an expensive ML model over your entire catalog for every query. You need staged filtering:
Query: "wireless headphones"
↓
Stage 1: Retrieval (~100ms)
- BM25 full-text search over Elasticsearch → 1,000 candidates
- ANN vector search on query embedding → 500 candidates
- Merge and deduplicate → ~1,500 candidates
↓
Stage 2: Pre-ranking (~20ms)
- Lightweight model (logistic regression or small DNN)
- Filter to top 200 candidates
↓
Stage 3: Ranking (~50ms)
- Full-feature LambdaMART or LambdaRank model
- Personalized features, real-time signals
- Score top 200 → return top 50
↓
Stage 4: Re-ranking (~10ms)
- Business rules: promoted listings, diversity constraints, freshness boost
- Return top 10-20 results
Feature Engineering
Query features:
- Query tokens, n-grams, intent classification (navigational vs informational vs transactional)
- Query embedding (from fine-tuned sentence encoder)
- Query historical CTR and conversion rate
- Session context: previous queries, category browsing history
Document/item features:
- Title, description, category, brand, price
- Item embedding (from product encoder, same space as query encoder)
- Historical CTR, conversion rate, rating, review count
- Inventory status, freshness, seller reputation
Query-document interaction features:
- BM25 score (term-frequency-based relevance)
- Cosine similarity between query and item embeddings
- Exact match fields: title contains all query terms?
- User’s purchase history for this item or category
User personalization features:
- User’s brand affinity scores (pre-computed, updated daily)
- Price sensitivity (mean purchase price, percentile)
- Category affinity from last 30-day purchase history
- Cold start: use demographic cluster features for new users
Learning to Rank Models
Pointwise approach: Predict relevance score per item independently. Use regression or classification. Simple but ignores relative ordering.
Pairwise approach (RankNet, LambdaRank): Learn from pairs of items — “item A should rank above item B.” Models the ordering relationship directly.
Listwise approach (LambdaMART, ListNet): Optimize a list-level metric like NDCG directly.
import lightgbm as lgb
import numpy as np
from sklearn.model_selection import GroupKFold
def train_lambdamart(X, y, groups):
"""
Train LambdaMART ranking model using LightGBM.
X: feature matrix [n_samples, n_features]
y: relevance labels (0-4 scale: not relevant → perfect)
groups: query IDs for grouping (items within same query)
"""
params = {
'objective': 'lambdarank',
'metric': 'ndcg',
'ndcg_eval_at': [1, 3, 5, 10],
'learning_rate': 0.05,
'num_leaves': 63,
'min_child_samples': 50,
'feature_fraction': 0.8,
'verbose': -1,
'lambdarank_truncation_level': 30 # Optimize NDCG@30
}
# Group k-fold to keep queries together across folds
gkf = GroupKFold(n_splits=5)
ndcg_scores = []
for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
groups_train = [groups[i] for i in train_idx]
groups_val = [groups[i] for i in val_idx]
# LightGBM needs group sizes (not IDs)
_, train_group_sizes = np.unique(groups_train, return_counts=True)
_, val_group_sizes = np.unique(groups_val, return_counts=True)
dtrain = lgb.Dataset(X_train, label=y_train, group=train_group_sizes)
dval = lgb.Dataset(X_val, label=y_val, group=val_group_sizes)
model = lgb.train(
params, dtrain,
num_boost_round=500,
valid_sets=[dval],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
ndcg_scores.append(model.best_score['valid_0']['ndcg@10'])
print(f"Fold {fold}: NDCG@10 = {ndcg_scores[-1]:.4f}")
print(f"Mean NDCG@10: {np.mean(ndcg_scores):.4f} ± {np.std(ndcg_scores):.4f}")
return model
Training Data: Implicit Feedback
You rarely have explicit relevance judgments at scale. You build training data from user behavior:
| Signal | Strength | Noise Source |
|---|---|---|
| Purchase | Very strong (bought = relevant) | Gifts, accidental purchases |
| Add to cart | Strong | Wishlist behavior without purchase intent |
| Click + long dwell | Moderate | Position bias (top items get clicks regardless) |
| Click only | Weak | High position bias, curiosity clicks |
| Impression no click | Weak negative signal | Not seen below fold |
Position bias correction: Users click top results more because they’re on top, not necessarily because they’re better. Use Inverse Propensity Scoring (IPS) to debias:
def compute_ips_weights(positions, propensity_model):
"""
Inverse Propensity Scoring for position bias correction.
Propensity: probability of item being clicked at position k given it was examined.
"""
# Exam probability decreases with position (cascade model)
exam_prob = np.array([1.0 / (pos ** 0.5) for pos in positions])
# IPS weight: 1 / propensity
# Clip to prevent extreme weights from rare positions
weights = np.clip(1.0 / exam_prob, 0.1, 10.0)
return weights
Two-Tower Model for Retrieval
The retrieval stage uses a two-tower neural network to enable ANN search:
import torch
import torch.nn as nn
class TwoTowerModel(nn.Module):
def __init__(self, query_dim=768, item_dim=256, embedding_dim=128):
super().__init__()
# Query tower: encode user query
self.query_tower = nn.Sequential(
nn.Linear(query_dim, 256),
nn.ReLU(),
nn.LayerNorm(256),
nn.Linear(256, embedding_dim),
)
# Item tower: encode product features
self.item_tower = nn.Sequential(
nn.Linear(item_dim, 256),
nn.ReLU(),
nn.LayerNorm(256),
nn.Linear(256, embedding_dim),
)
def forward(self, query_features, item_features=None):
query_emb = self.query_tower(query_features)
query_emb = nn.functional.normalize(query_emb, dim=-1)
if item_features is not None:
item_emb = self.item_tower(item_features)
item_emb = nn.functional.normalize(item_emb, dim=-1)
# Dot product similarity
return (query_emb * item_emb).sum(dim=-1)
return query_emb # For pre-computing item embeddings
def encode_items(self, item_features):
"""Pre-compute item embeddings for ANN index."""
item_emb = self.item_tower(item_features)
return nn.functional.normalize(item_emb, dim=-1)
Item embeddings are precomputed nightly and indexed in a vector database (Faiss, Qdrant) for fast ANN retrieval at query time.
Evaluation
import numpy as np
def ndcg_at_k(relevance_scores, k=10):
"""
Normalized Discounted Cumulative Gain at k.
relevance_scores: list of relevance labels in ranked order (0-4 scale)
"""
def dcg(scores, k):
scores = np.array(scores[:k], dtype=float)
gains = (2 ** scores - 1) / np.log2(np.arange(2, len(scores) + 2))
return gains.sum()
actual_dcg = dcg(relevance_scores, k)
ideal_dcg = dcg(sorted(relevance_scores, reverse=True), k)
if ideal_dcg == 0:
return 0.0
return actual_dcg / ideal_dcg
def mean_reciprocal_rank(relevant_positions):
"""
MRR: 1/rank of first relevant result. Good for navigational queries.
relevant_positions: list of positions (1-indexed) where relevant results appear
"""
if not relevant_positions:
return 0.0
return 1.0 / min(relevant_positions)
Online Experimentation
- A/B testing: Route 10% of queries to new ranking model; measure CTR, conversion, revenue per query
- Interleaved evaluation: Merge results from two models in a single SERP; track which model’s results get clicked — faster signal, smaller traffic needed
- Guardrail metrics: Query abandonment rate, time-to-click must not regress even if primary metric improves
Depth Levels
Junior: Describe multi-stage pipeline, explain BM25 vs embedding retrieval, choose a ranking model.
Senior: Implement LambdaMART training, handle position bias, design online evaluation.
Staff: Personalization at scale (user embedding freshness, cold start), fairness constraints (seller diversity), multi-objective ranking (relevance vs revenue vs diversity), latency vs quality trade-off across pipeline stages.
Related ML Topics
- Embeddings and Vector Databases — two-tower retrieval encodes queries and items into the same embedding space; HNSW ANN search retrieves candidates in Stage 1
- Design a Recommendation Engine — search ranking and recommendation share the same two-stage retrieval+ranking architecture; key difference is explicit query vs. inferred intent
- Classification Metrics — NDCG, MRR, and MAP are ranking-specific metrics that extend precision/recall for ordered result sets
- Handling Imbalanced Datasets — implicit feedback training data is highly skewed; non-clicked impressions are weak negative signals, not true negatives
- How to Detect Model Drift in Production — ranking model quality drifts with catalog changes, user behavior shifts, and seasonal patterns; monitor NDCG on held-out query sets