Search Ranking System Low-Level Design: Relevance Scoring, Learning to Rank, and Feature Engineering

Search Ranking System Overview

A search ranking system determines the order in which retrieved documents are presented to users. The pipeline has four distinct stages: retrieval, first-pass scoring, reranking, and serving. Each stage progressively narrows and reorders results, balancing quality against latency constraints.

Ranking Pipeline

Retrieval: Full-text index lookup returns thousands of candidate documents matching the query.
First-pass scoring: Lightweight scorer (BM25) orders candidates quickly.
Reranking: Expensive learning-to-rank model scores the top-N candidates.
Serving: Final ranked list returned to user within latency budget.

BM25 as Baseline Retrieval Scorer

BM25 is the standard baseline for text relevance. It extends TF-IDF with term frequency saturation (high TF contributes diminishing returns) and document length normalization. BM25 scores each document given query terms, producing an initial ranked list before any learned model is applied. It is fast, interpretable, and requires no training data.

Learning-to-Rank Models

Learning-to-rank (LTR) uses labeled training data to train models that order documents better than BM25 alone. Three main paradigms:

Pointwise: Train a regression or classification model to predict relevance score per document independently. Simple but ignores relative ordering.
Pairwise (LambdaMART): Optimize a loss function over pairs of documents — prefer document A over document B when A is more relevant. LambdaMART is the industry standard pairwise algorithm, combining gradient boosted trees with LambdaRank gradients.
Listwise: Optimize a list-level metric such as NDCG directly. More principled but harder to train stably.

Feature Engineering

Features are the inputs to the LTR model. They fall into three groups:

Query features: Query length (short queries are more ambiguous), query frequency (popular queries have better signal), entity detection (is this a named entity like a product or person?).
Document features: Title match score, body match score, document freshness (recency decay), page authority (inbound link count, domain trust).
Interaction features: Click-through rate (CTR) for query-document pair, dwell time (time spent on page after click), previous clicks by this user on this document.

Training Data

LTR models require labeled data. Two sources:

Human judgments: Annotators rate document relevance for sampled queries on a scale (e.g., perfect / excellent / good / fair / bad). High quality but expensive.
Implicit feedback: Clicks, purchases, and time-on-page logged from production traffic. Cheap and abundant but noisy — clicks reflect position bias, not just relevance.

Position Bias Correction

Documents shown at position 1 receive far more clicks than equally relevant documents at position 5, purely due to position. Inverse propensity scoring (IPS) corrects for this: clicks at higher positions are down-weighted by the propensity of being clicked at that position. Propensity can be estimated via randomization experiments (swap top results randomly for a fraction of traffic).

Serving Constraints

Full reranking of thousands of candidates with a complex LTR model cannot complete in time. The solution is a two-stage approach:

Stage 1 — Fast retrieval: BM25 retrieves top-1000 documents in a few milliseconds.
Stage 2 — Expensive reranking: LambdaMART or neural reranker scores top-100 documents from stage 1.

Full reranking must complete in under 50ms end-to-end. Feature lookup, model inference, and result serialization must all fit within this budget.

Feature Store for Serving

Precomputing features is essential to meet latency targets:

Document features (page authority, body length, CTR history) are precomputed at index time and stored in a key-value store (Redis or RocksDB).
Query features (query frequency, entity detection result) are computed at query time, often cached for popular queries.

Model Update Cadence and Cold Start

Models are retrained daily on the most recent interaction logs. Freshness matters because user behavior and content evolve. New documents have no interaction signals — the model falls back to content features (title match, body match, freshness) until enough clicks accumulate to use interaction features reliably.

Online Evaluation

Ranking changes are validated through A/B tests measuring NDCG (ranking quality), CTR, and revenue per search. Statistical significance is required before full rollout. Shadow mode (new model runs in parallel but results not shown) can validate correctness before A/B exposure.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design a learning-to-rank pipeline for a large-scale search system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A learning-to-rank pipeline typically uses a three-stage approach: candidate retrieval (ANN or inverted index), a pointwise or pairwise scoring model (e.g., LambdaMART or a neural ranker), and a final reranking layer that applies business rules and diversity constraints. Feature engineering covers query-document signals (BM25, TF-IDF, semantic similarity via dense embeddings), document-level signals (PageRank, freshness, click-through rate), and user-context signals (session history, geo, device). Labels are generated from click logs using techniques like position-bias correction or counterfactual learning-to-rank. The model is trained offline and served via a low-latency feature store that precomputes document features and fetches query-time features on the fly.”
}
},
{
“@type”: “Question”,
“name”: “What features matter most when building a relevance scoring model, and how do you handle feature freshness?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “High-signal features include query-document semantic similarity (dense retrieval scores), historical CTR normalized by position, document authority (PageRank or domain trust score), recency for time-sensitive queries, and exact-match signals (title match, URL match). Feature freshness is handled by separating features into static (indexed offline, updated daily or weekly) and dynamic (computed at query time or updated in near-real-time via a streaming pipeline). A feature store like Feast or a Redis-backed system serves dynamic features with sub-millisecond latency. Staleness budgets are defined per feature: CTR can tolerate hourly staleness, while trending signals require sub-minute updates via a Kafka-driven aggregation pipeline.”
}
},
{
“@type”: “Question”,
“name”: “How do you evaluate search ranking quality beyond NDCG, and what offline/online metrics do you track?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Offline metrics include NDCG@k, MRR, MAP, and Expected Reciprocal Rank (ERR), which better models user stopping behavior. For diversity, you track intra-list diversity (ILD) and subtopic recall. Online metrics are more important: CTR, time-to-first-click, abandonment rate (no click), and long-click rate (dwell time > 30s as a proxy for satisfaction). You also measure query reformulation rate — a high rate signals poor ranking. A/B tests use these metrics with statistical significance guardrails (typically p < 0.05, minimum detectable effect defined upfront). Position bias in click data is corrected with inverse propensity scoring or a result randomization experiment to estimate true relevance."
}
},
{
"@type": "Question",
"name": "How do you architect the serving layer of a search ranking system to meet sub-100ms latency SLAs?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The serving layer uses a cascade architecture: a fast first-stage ranker (lightweight linear model or quantized neural net) scores thousands of candidates in under 20ms, then a heavier second-stage model rescores the top 200-500. Document features are precomputed and stored in a low-latency key-value store (Redis or RocksDB) to avoid on-the-fly computation. Query-time feature extraction (query embedding, entity tagging) runs in parallel using async I/O. The scoring model is typically exported as ONNX or TensorRT and served on CPU with SIMD optimizations, or on GPU for neural rerankers at high QPS. Circuit breakers fall back to a simpler BM25 ranker if the ML service exceeds its latency budget. Caching popular query results with a TTL of minutes further reduces p99 latency."
}
}
]
}