Q: How do you architect the serving layer of a search ranking system to meet sub-100ms latency SLAs?

The serving layer uses a cascade architecture: a fast first-stage ranker (lightweight linear model or quantized neural net) scores thousands of candidates in under 20ms, then a heavier second-stage model rescores the top 200-500. Document features are precomputed and stored in a low-latency key-value store (Redis or RocksDB) to avoid on-the-fly computation. Query-time feature extraction (query embedding, entity tagging) runs in parallel using async I/O. The scoring model is typically exported as ONNX or TensorRT and served on CPU with SIMD optimizations, or on GPU for neural rerankers at high QPS. Circuit breakers fall back to a simpler BM25 ranker if the ML service exceeds its latency budget. Caching popular query results with a TTL of minutes further reduces p99 latency.

Question 1

How do you design a learning-to-rank pipeline for a large-scale search system?

Accepted Answer

A learning-to-rank pipeline typically uses a three-stage approach: candidate retrieval (ANN or inverted index), a pointwise or pairwise scoring model (e.g., LambdaMART or a neural ranker), and a final reranking layer that applies business rules and diversity constraints. Feature engineering covers query-document signals (BM25, TF-IDF, semantic similarity via dense embeddings), document-level signals (PageRank, freshness, click-through rate), and user-context signals (session history, geo, device). Labels are generated from click logs using techniques like position-bias correction or counterfactual learning-to-rank. The model is trained offline and served via a low-latency feature store that precomputes document features and fetches query-time features on the fly.

Question 2

What features matter most when building a relevance scoring model, and how do you handle feature freshness?

Accepted Answer

High-signal features include query-document semantic similarity (dense retrieval scores), historical CTR normalized by position, document authority (PageRank or domain trust score), recency for time-sensitive queries, and exact-match signals (title match, URL match). Feature freshness is handled by separating features into static (indexed offline, updated daily or weekly) and dynamic (computed at query time or updated in near-real-time via a streaming pipeline). A feature store like Feast or a Redis-backed system serves dynamic features with sub-millisecond latency. Staleness budgets are defined per feature: CTR can tolerate hourly staleness, while trending signals require sub-minute updates via a Kafka-driven aggregation pipeline.

Question 3

How do you evaluate search ranking quality beyond NDCG, and what offline/online metrics do you track?

Accepted Answer

Offline metrics include NDCG@k, MRR, MAP, and Expected Reciprocal Rank (ERR), which better models user stopping behavior. For diversity, you track intra-list diversity (ILD) and subtopic recall. Online metrics are more important: CTR, time-to-first-click, abandonment rate (no click), and long-click rate (dwell time > 30s as a proxy for satisfaction). You also measure query reformulation rate — a high rate signals poor ranking. A/B tests use these metrics with statistical significance guardrails (typically p < 0.05, minimum detectable effect defined upfront). Position bias in click data is corrected with inverse propensity scoring or a result randomization experiment to estimate true relevance.

Question 4

How do you architect the serving layer of a search ranking system to meet sub-100ms latency SLAs?

Accepted Answer

The serving layer uses a cascade architecture: a fast first-stage ranker (lightweight linear model or quantized neural net) scores thousands of candidates in under 20ms, then a heavier second-stage model rescores the top 200-500. Document features are precomputed and stored in a low-latency key-value store (Redis or RocksDB) to avoid on-the-fly computation. Query-time feature extraction (query embedding, entity tagging) runs in parallel using async I/O. The scoring model is typically exported as ONNX or TensorRT and served on CPU with SIMD optimizations, or on GPU for neural rerankers at high QPS. Circuit breakers fall back to a simpler BM25 ranker if the ML service exceeds its latency budget. Caching popular query results with a TTL of minutes further reduces p99 latency.

Search Ranking System Low-Level Design: Relevance Scoring, Learning to Rank, and Feature Engineering

Search Ranking System Overview

Ranking Pipeline

BM25 as Baseline Retrieval Scorer

Learning-to-Rank Models

Feature Engineering

Training Data

Position Bias Correction

Serving Constraints

Feature Store for Serving

Model Update Cadence and Cold Start

Online Evaluation