Search Ranking Pipeline Low-Level Design: Query Understanding, Scoring Signals, and Re-ranking

Search Ranking Pipeline Overview

A search ranking pipeline transforms a raw user query into an ordered list of results that maximize relevance, engagement, and business value. The system operates across multiple stages: query understanding, candidate retrieval, feature scoring, and machine-learning re-ranking. Each stage filters and refines the candidate set before passing it downstream.

Requirements

Functional Requirements

Accept free-text queries and return ranked result lists within 200 ms at p99.
Support first-stage retrieval via BM25 and dense vector search (ANN).
Compute multi-dimensional scoring signals: textual relevance, freshness, popularity, and personalization.
Apply a learned ranking model (LambdaMART or a two-tower neural ranker) in the re-ranking stage.
Support A/B experiment buckets that swap ranking models without code deployments.
Log feature vectors and labels for continuous model training.

Non-Functional Requirements

Throughput: 50,000 queries per second at peak.
Availability: 99.99% uptime; retrieval and ranking are decoupled so a ranker outage degrades gracefully to BM25 order.
Feature freshness: document signals updated within 60 seconds of ingestion.

Data Model

The Document Index stores inverted postings (term → doc IDs with TF-IDF weights) alongside a dense vector field for embedding-based retrieval. Each document record carries: doc_id, url, title, body_tokens, embedding float[768], freshness_ts, click_rate_7d, and quality_score.

The Feature Store maps (query_id, doc_id) pairs to a sparse feature vector at scoring time. Features are grouped into namespaces: query-doc textual features (BM25 score, title match, anchor text match), document-level signals (PageRank, engagement rate, spam score), and user-context features (geographic relevance, historical CTR for this user-query pair).

The Experiment Config table maps experiment bucket IDs to ranker model versions, feature sets, and blending weights, enabling live traffic splits without restarts.

Core Algorithms

First-Stage Retrieval

BM25 retrieval runs against the inverted index to produce up to 1,000 candidates. In parallel, an ANN search (HNSW graph over document embeddings) returns the top 500 semantically similar documents. The union is deduplicated and capped at 1,000 candidates before scoring.

Feature-Based Scoring

Each candidate receives a score vector assembled from the Feature Store. Textual signals are computed on the fly using precomputed term statistics. Document signals are read from a low-latency Redis shard keyed by doc_id. The assembled vector is passed to the re-ranker.

ML Re-ranking

A LambdaMART gradient-boosted tree model (or optionally a cross-encoder transformer) scores each candidate using the feature vector. The model is loaded as a shared in-process artifact, updated via a blue-green model swap triggered by the experiment config. Output scores are linearly blended with a diversity penalty to avoid result-set clustering around a single sub-topic.

API Design

The ranking pipeline exposes an internal gRPC service:

RankQuery(QueryRequest) → RankedResults — orchestrates retrieval, scoring, and re-ranking for a single query.
BatchRank(BatchQueryRequest) → BatchRankedResults — used by offline evaluation jobs.
GetExperimentConfig(BucketId) → ExperimentConfig — returns active model and feature-set metadata for a given A/B bucket.

An upstream API gateway handles authentication, rate-limiting, and routes each request to the correct experiment bucket before calling RankQuery.

Scalability and Fault Tolerance

The retrieval layer is sharded by document partition (each shard holds ~50 M documents). Fan-out queries hit all shards in parallel; a scatter-gather aggregator merges and deduplicates results. Shard replicas handle read failover with no writes hitting the hot path.

The scoring and re-ranking service is stateless and horizontally scalable behind a load balancer. Feature Store reads use a local L1 cache (process-level LRU, 128 MB) backed by a regional Redis cluster. On cache miss the service falls back to direct feature computation, which adds ~5 ms latency.

Model updates are deployed with zero downtime: the new artifact is loaded into a standby slot, warmed with shadow traffic, then promoted atomically. Rollback is a single config write.

Monitoring and Observability

Emit latency histograms per pipeline stage (retrieval, feature fetch, inference).
Track feature coverage rate: fraction of candidates that received all feature namespaces.
Log sampled feature vectors and served rankings to an offline store for model retraining.
Alert on NDCG degradation greater than 2% relative in online A/B metrics within one hour of a model swap.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does BM25 first-stage retrieval work in a search ranking pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “BM25 scores each document against a query using term frequency, inverse document frequency, and a document length normalization factor. In a two-stage pipeline it acts as a fast candidate retrieval layer, fetching the top-K documents from an inverted index before heavier re-ranking models are applied.”
}
},
{
“@type”: “Question”,
“name”: “What feature-based scoring signals are commonly used in search ranking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Common signals include BM25 score, click-through rate, freshness, document authority (PageRank-style), query-document semantic similarity, user engagement metrics, and position bias corrections. These features are fed into a learned ranking model alongside raw text signals.”
}
},
{
“@type”: “Question”,
“name”: “What is LambdaMART and why is it used for re-ranking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “LambdaMART is a gradient-boosted tree algorithm that directly optimizes ranking metrics such as NDCG. It is used for re-ranking because it can combine hundreds of heterogeneous features efficiently and produces well-calibrated ranked lists without requiring a neural inference budget at query time.”
}
},
{
“@type”: “Question”,
“name”: “How do you configure an A/B experiment for a search ranking change?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An A/B experiment config specifies the traffic split (e.g., 50/50), the treatment (new ranking model or feature), the primary metric (NDCG, CTR, session satisfaction), the minimum detectable effect, and a guardrail set of metrics that must not regress. Experiment config is stored in a feature-flag service so rollout can be paused without a code deploy.”
}
}
]
}