Search Ranking System Overview
A search ranking system determines the order in which retrieved documents are presented to users. The pipeline has four distinct stages: retrieval, first-pass scoring, reranking, and serving. Each stage progressively narrows and reorders results, balancing quality against latency constraints.
Ranking Pipeline
- Retrieval: Full-text index lookup returns thousands of candidate documents matching the query.
- First-pass scoring: Lightweight scorer (BM25) orders candidates quickly.
- Reranking: Expensive learning-to-rank model scores the top-N candidates.
- Serving: Final ranked list returned to user within latency budget.
BM25 as Baseline Retrieval Scorer
BM25 is the standard baseline for text relevance. It extends TF-IDF with term frequency saturation (high TF contributes diminishing returns) and document length normalization. BM25 scores each document given query terms, producing an initial ranked list before any learned model is applied. It is fast, interpretable, and requires no training data.
Learning-to-Rank Models
Learning-to-rank (LTR) uses labeled training data to train models that order documents better than BM25 alone. Three main paradigms:
- Pointwise: Train a regression or classification model to predict relevance score per document independently. Simple but ignores relative ordering.
- Pairwise (LambdaMART): Optimize a loss function over pairs of documents — prefer document A over document B when A is more relevant. LambdaMART is the industry standard pairwise algorithm, combining gradient boosted trees with LambdaRank gradients.
- Listwise: Optimize a list-level metric such as NDCG directly. More principled but harder to train stably.
Feature Engineering
Features are the inputs to the LTR model. They fall into three groups:
- Query features: Query length (short queries are more ambiguous), query frequency (popular queries have better signal), entity detection (is this a named entity like a product or person?).
- Document features: Title match score, body match score, document freshness (recency decay), page authority (inbound link count, domain trust).
- Interaction features: Click-through rate (CTR) for query-document pair, dwell time (time spent on page after click), previous clicks by this user on this document.
Training Data
LTR models require labeled data. Two sources:
- Human judgments: Annotators rate document relevance for sampled queries on a scale (e.g., perfect / excellent / good / fair / bad). High quality but expensive.
- Implicit feedback: Clicks, purchases, and time-on-page logged from production traffic. Cheap and abundant but noisy — clicks reflect position bias, not just relevance.
Position Bias Correction
Documents shown at position 1 receive far more clicks than equally relevant documents at position 5, purely due to position. Inverse propensity scoring (IPS) corrects for this: clicks at higher positions are down-weighted by the propensity of being clicked at that position. Propensity can be estimated via randomization experiments (swap top results randomly for a fraction of traffic).
Serving Constraints
Full reranking of thousands of candidates with a complex LTR model cannot complete in time. The solution is a two-stage approach:
- Stage 1 — Fast retrieval: BM25 retrieves top-1000 documents in a few milliseconds.
- Stage 2 — Expensive reranking: LambdaMART or neural reranker scores top-100 documents from stage 1.
Full reranking must complete in under 50ms end-to-end. Feature lookup, model inference, and result serialization must all fit within this budget.
Feature Store for Serving
Precomputing features is essential to meet latency targets:
- Document features (page authority, body length, CTR history) are precomputed at index time and stored in a key-value store (Redis or RocksDB).
- Query features (query frequency, entity detection result) are computed at query time, often cached for popular queries.
Model Update Cadence and Cold Start
Models are retrained daily on the most recent interaction logs. Freshness matters because user behavior and content evolve. New documents have no interaction signals — the model falls back to content features (title match, body match, freshness) until enough clicks accumulate to use interaction features reliably.
Online Evaluation
Ranking changes are validated through A/B tests measuring NDCG (ranking quality), CTR, and revenue per search. Statistical significance is required before full rollout. Shadow mode (new model runs in parallel but results not shown) can validate correctness before A/B exposure.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering