Low Level Design: Search Relevance Ranking

Search relevance ranking determines the order in which results are presented for a given query. Poor ranking makes a search engine useless even with correct retrieval. The ranking pipeline applies a series of increasingly sophisticated signals: lexical matching (BM25), semantic similarity, and learned ranking models trained on user engagement data.

TF-IDF

Term Frequency-Inverse Document Frequency scores documents by how often a query term appears (TF) weighted by how rare the term is across all documents (IDF). TF(t,d) = count of term t in document d / total terms in d. IDF(t) = log(total documents / documents containing t). TF-IDF score = TF * IDF. Rare terms (high IDF) are more discriminating; frequent common words (low IDF) contribute less. TF-IDF is the foundation of classical information retrieval.

BM25

BM25 (Best Match 25) improves on TF-IDF with two enhancements: term frequency saturation (additional occurrences of a term have diminishing returns past a threshold, controlled by parameter k1) and document length normalization (penalizes long documents that accumulate term counts through repetition, controlled by parameter b). BM25 is the default relevance algorithm in Elasticsearch and Solr. Parameters: k1=1.2-2.0, b=0.75 are standard starting points, tuned per corpus.

Query Understanding

Before ranking, understand the query intent. Query classification: is this a navigational query (user wants a specific page), informational (user wants to learn), or transactional (user wants to do something)? Query expansion: add synonyms (phone → mobile, smartphone). Spell correction: fix typos before retrieval. Entity recognition: identify named entities in the query (Paris → the city, not a person). Query understanding preprocessing improves both recall (more relevant documents retrieved) and precision (fewer irrelevant results).

Learning to Rank (LTR)

Learning to Rank trains a model on (query, document, relevance_label) triples. Relevance labels come from human raters (explicit relevance judgment) or user behavior (implicit signals: clicks, dwell time, purchases). Feature vector per (query, document) pair: BM25 score, query-document semantic similarity, document freshness, document authority (PageRank-like), click-through rate. Models: LambdaMART (gradient boosted trees), RankNet (neural), or LambdaRank. Optimizes ranking metrics (NDCG, MAP) directly.

Semantic Search

Lexical search misses semantic matches: query “automobile accident” doesn’t match document containing “car crash.” Semantic search encodes query and documents as dense vectors (BERT, sentence-transformers) and retrieves documents whose embeddings are similar to the query embedding. Hybrid search combines BM25 (exact lexical match) with dense retrieval (semantic similarity) via Reciprocal Rank Fusion or learned weighting. Elasticsearch and Weaviate support hybrid search natively.

Freshness and Authority

For news and time-sensitive content, freshness is a ranking signal: newer documents rank higher for queries where recency matters (sports scores, breaking news). Authority measures the quality and trustworthiness of the source: PageRank uses link graph analysis; for e-commerce, seller rating and review count serve as authority signals. Both freshness and authority are features in the LTR model, learned to be appropriately weighted for each query intent class.

Ranking Evaluation

Measure ranking quality with: NDCG (Normalized Discounted Cumulative Gain, measures whether highly-relevant results appear at the top), MRR (Mean Reciprocal Rank, for single-answer queries), P@K (precision at K results). Offline evaluation uses human-labeled query-document pairs. Online evaluation uses A/B tests measuring click-through rate, dwell time, and task completion rate. Offline and online metrics often diverge — offline NDCG improvements may not translate to online CTR gains if human relevance labels don’t capture real user preferences.

Scroll to Top