ML System Design: Build a Search Ranking System

Search ranking is one of the most technically demanding ML system design problems. It combines information retrieval, multi-stage ranking, real-time feature computation, and online experimentation. This question appears at Google, Bing, Amazon, LinkedIn, and any company with a search product.

Step 1: Clarify the Problem

  • What are we ranking? Web pages (Google), products (Amazon), job listings (LinkedIn), social posts (Twitter/X), or internal documents?
  • What signals do we have? Query-document relevance, user behavior (clicks, dwell time, purchases), freshness, authority (PageRank), personalization?
  • What are the constraints? Latency (100ms for web search, 500ms for e-commerce), result diversity, fairness requirements.
  • How do we measure success? NDCG@10, MRR, click-through rate, conversion rate, revenue per query.

Assume: e-commerce product search (Amazon-style), 100K queries/second, personalized ranking, <200ms P99 latency.

Multi-Stage Retrieval and Ranking Architecture

The fundamental insight is that you cannot run an expensive ML model over your entire catalog for every query. You need staged filtering:

Query: "wireless headphones"
       ↓
Stage 1: Retrieval (~100ms)
  - BM25 full-text search over Elasticsearch → 1,000 candidates
  - ANN vector search on query embedding → 500 candidates
  - Merge and deduplicate → ~1,500 candidates
       ↓
Stage 2: Pre-ranking (~20ms)
  - Lightweight model (logistic regression or small DNN)
  - Filter to top 200 candidates
       ↓
Stage 3: Ranking (~50ms)
  - Full-feature LambdaMART or LambdaRank model
  - Personalized features, real-time signals
  - Score top 200 → return top 50
       ↓
Stage 4: Re-ranking (~10ms)
  - Business rules: promoted listings, diversity constraints, freshness boost
  - Return top 10-20 results

Feature Engineering

Query features:

  • Query tokens, n-grams, intent classification (navigational vs informational vs transactional)
  • Query embedding (from fine-tuned sentence encoder)
  • Query historical CTR and conversion rate
  • Session context: previous queries, category browsing history

Document/item features:

  • Title, description, category, brand, price
  • Item embedding (from product encoder, same space as query encoder)
  • Historical CTR, conversion rate, rating, review count
  • Inventory status, freshness, seller reputation

Query-document interaction features:

  • BM25 score (term-frequency-based relevance)
  • Cosine similarity between query and item embeddings
  • Exact match fields: title contains all query terms?
  • User’s purchase history for this item or category

User personalization features:

  • User’s brand affinity scores (pre-computed, updated daily)
  • Price sensitivity (mean purchase price, percentile)
  • Category affinity from last 30-day purchase history
  • Cold start: use demographic cluster features for new users

Learning to Rank Models

Pointwise approach: Predict relevance score per item independently. Use regression or classification. Simple but ignores relative ordering.

Pairwise approach (RankNet, LambdaRank): Learn from pairs of items — “item A should rank above item B.” Models the ordering relationship directly.

Listwise approach (LambdaMART, ListNet): Optimize a list-level metric like NDCG directly.

import lightgbm as lgb
import numpy as np
from sklearn.model_selection import GroupKFold

def train_lambdamart(X, y, groups):
    """
    Train LambdaMART ranking model using LightGBM.

    X: feature matrix [n_samples, n_features]
    y: relevance labels (0-4 scale: not relevant → perfect)
    groups: query IDs for grouping (items within same query)
    """
    params = {
        'objective': 'lambdarank',
        'metric': 'ndcg',
        'ndcg_eval_at': [1, 3, 5, 10],
        'learning_rate': 0.05,
        'num_leaves': 63,
        'min_child_samples': 50,
        'feature_fraction': 0.8,
        'verbose': -1,
        'lambdarank_truncation_level': 30  # Optimize NDCG@30
    }

    # Group k-fold to keep queries together across folds
    gkf = GroupKFold(n_splits=5)
    ndcg_scores = []

    for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        groups_train = [groups[i] for i in train_idx]
        groups_val = [groups[i] for i in val_idx]

        # LightGBM needs group sizes (not IDs)
        _, train_group_sizes = np.unique(groups_train, return_counts=True)
        _, val_group_sizes = np.unique(groups_val, return_counts=True)

        dtrain = lgb.Dataset(X_train, label=y_train, group=train_group_sizes)
        dval = lgb.Dataset(X_val, label=y_val, group=val_group_sizes)

        model = lgb.train(
            params, dtrain,
            num_boost_round=500,
            valid_sets=[dval],
            callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
        )

        ndcg_scores.append(model.best_score['valid_0']['ndcg@10'])
        print(f"Fold {fold}: NDCG@10 = {ndcg_scores[-1]:.4f}")

    print(f"Mean NDCG@10: {np.mean(ndcg_scores):.4f} ± {np.std(ndcg_scores):.4f}")
    return model

Training Data: Implicit Feedback

You rarely have explicit relevance judgments at scale. You build training data from user behavior:

Signal Strength Noise Source
Purchase Very strong (bought = relevant) Gifts, accidental purchases
Add to cart Strong Wishlist behavior without purchase intent
Click + long dwell Moderate Position bias (top items get clicks regardless)
Click only Weak High position bias, curiosity clicks
Impression no click Weak negative signal Not seen below fold

Position bias correction: Users click top results more because they’re on top, not necessarily because they’re better. Use Inverse Propensity Scoring (IPS) to debias:

def compute_ips_weights(positions, propensity_model):
    """
    Inverse Propensity Scoring for position bias correction.
    Propensity: probability of item being clicked at position k given it was examined.
    """
    # Exam probability decreases with position (cascade model)
    exam_prob = np.array([1.0 / (pos ** 0.5) for pos in positions])

    # IPS weight: 1 / propensity
    # Clip to prevent extreme weights from rare positions
    weights = np.clip(1.0 / exam_prob, 0.1, 10.0)
    return weights

Two-Tower Model for Retrieval

The retrieval stage uses a two-tower neural network to enable ANN search:

import torch
import torch.nn as nn

class TwoTowerModel(nn.Module):
    def __init__(self, query_dim=768, item_dim=256, embedding_dim=128):
        super().__init__()

        # Query tower: encode user query
        self.query_tower = nn.Sequential(
            nn.Linear(query_dim, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, embedding_dim),
        )

        # Item tower: encode product features
        self.item_tower = nn.Sequential(
            nn.Linear(item_dim, 256),
            nn.ReLU(),
            nn.LayerNorm(256),
            nn.Linear(256, embedding_dim),
        )

    def forward(self, query_features, item_features=None):
        query_emb = self.query_tower(query_features)
        query_emb = nn.functional.normalize(query_emb, dim=-1)

        if item_features is not None:
            item_emb = self.item_tower(item_features)
            item_emb = nn.functional.normalize(item_emb, dim=-1)
            # Dot product similarity
            return (query_emb * item_emb).sum(dim=-1)

        return query_emb  # For pre-computing item embeddings

    def encode_items(self, item_features):
        """Pre-compute item embeddings for ANN index."""
        item_emb = self.item_tower(item_features)
        return nn.functional.normalize(item_emb, dim=-1)

Item embeddings are precomputed nightly and indexed in a vector database (Faiss, Qdrant) for fast ANN retrieval at query time.

Evaluation

import numpy as np

def ndcg_at_k(relevance_scores, k=10):
    """
    Normalized Discounted Cumulative Gain at k.
    relevance_scores: list of relevance labels in ranked order (0-4 scale)
    """
    def dcg(scores, k):
        scores = np.array(scores[:k], dtype=float)
        gains = (2 ** scores - 1) / np.log2(np.arange(2, len(scores) + 2))
        return gains.sum()

    actual_dcg = dcg(relevance_scores, k)
    ideal_dcg = dcg(sorted(relevance_scores, reverse=True), k)

    if ideal_dcg == 0:
        return 0.0
    return actual_dcg / ideal_dcg

def mean_reciprocal_rank(relevant_positions):
    """
    MRR: 1/rank of first relevant result. Good for navigational queries.
    relevant_positions: list of positions (1-indexed) where relevant results appear
    """
    if not relevant_positions:
        return 0.0
    return 1.0 / min(relevant_positions)

Online Experimentation

  • A/B testing: Route 10% of queries to new ranking model; measure CTR, conversion, revenue per query
  • Interleaved evaluation: Merge results from two models in a single SERP; track which model’s results get clicked — faster signal, smaller traffic needed
  • Guardrail metrics: Query abandonment rate, time-to-click must not regress even if primary metric improves

Depth Levels

Junior: Describe multi-stage pipeline, explain BM25 vs embedding retrieval, choose a ranking model.

Senior: Implement LambdaMART training, handle position bias, design online evaluation.

Staff: Personalization at scale (user embedding freshness, cold start), fairness constraints (seller diversity), multi-objective ranking (relevance vs revenue vs diversity), latency vs quality trade-off across pipeline stages.

Related ML Topics

  • Embeddings and Vector Databases — two-tower retrieval encodes queries and items into the same embedding space; HNSW ANN search retrieves candidates in Stage 1
  • Design a Recommendation Engine — search ranking and recommendation share the same two-stage retrieval+ranking architecture; key difference is explicit query vs. inferred intent
  • Classification Metrics — NDCG, MRR, and MAP are ranking-specific metrics that extend precision/recall for ordered result sets
  • Handling Imbalanced Datasets — implicit feedback training data is highly skewed; non-clicked impressions are weak negative signals, not true negatives
  • How to Detect Model Drift in Production — ranking model quality drifts with catalog changes, user behavior shifts, and seasonal patterns; monitor NDCG on held-out query sets
Scroll to Top