Synonym Expansion System: Overview
Synonym expansion augments search queries with equivalent or related terms so that a search for “automobile” also retrieves documents about “car” and “vehicle.” Low-level design covers the synonym graph data model, expansion strategies, asymmetric directionality, domain specificity, edit interface, and measurement of expansion impact on search quality.
Synonym Types
Not all synonyms are equal. The system must distinguish:
- Exact synonyms (bidirectional): car = automobile. Expansion in both directions is equally valid.
- Directional synonyms (one-way): “TV show” → “series” (expanding “TV show” to include “series” is useful, but “series” should not automatically expand to “TV show” as it would add noise for unrelated series like book series).
- Hyponyms (is-a relationships): sedan is-a car. Searching for “car” should optionally include “sedan,” but searching for “sedan” should not expand to “car” (too broad).
- Acronyms and abbreviations: ML = machine learning, NLP = natural language processing. Usually bidirectional with high confidence.
- Brand synonyms: “Kleenex” = “tissue” in consumer goods domains.
Synonym Graph
The synonym graph has terms as nodes and synonym relationships as directed edges. Each edge carries:
- direction: bidirectional or one_way (from term to synonym only)
- domain: general, tech, medical, legal, ecommerce — enables domain-specific overrides
- weight: confidence score 0.0-1.0. High-weight edges always expand. Low-weight edges expand only in boosting mode.
- active: flag for soft-disabling pairs without deletion
A synonym group (SynonymGroup) allows associating multiple terms as a cluster rather than pairwise edges, reducing the number of rows needed for large equivalence sets.
Query-Time Expansion Strategies
OR Expansion
Replace each query term with an OR clause of itself and its synonyms:
original query: "automobile repair"
expanded: "(automobile OR car OR vehicle) AND (repair OR fix OR maintenance)"
Pros: maximizes recall. Cons: can reduce precision if synonyms are noisy.
Boost Expansion
Keep original terms at full score, add synonyms with a lower score weight:
automobile^2 OR car^1 OR vehicle^0.5
Pros: original term results rank highest, synonyms fill in where original has no match. Cons: more complex query plan.
Index-Time Expansion
Expand synonyms at index time when documents are ingested. A document containing “automobile” also indexes “car” as if the document contained both.
Pros: simpler query; no query-time expansion logic.
Cons: index bloat; updating synonyms requires re-indexing all documents. Query-time expansion is preferred for agility.
Asymmetric Expansion
The one-way direction field enforces asymmetric expansion. Example: “TV” expands to {“television”, “show”, “series”} but querying for “series” does not expand to “TV” because the relationship is not reversible without adding noise from unrelated meanings of “series.”
Implementation: when building the expansion set for a query term, only follow edges where direction = 'bidirectional' OR where term = src_term AND direction = 'one_way'.
Domain-Specific Synonyms
General corpus synonyms can conflict with domain-specific meanings. Example: in programming, “Python” is a language, not a snake. In medical context, “cold” means illness, not temperature.
The domain field on SynonymPair and SynonymGroup allows the expansion engine to select synonyms matching the current search context. Context is inferred from:
- The product vertical the search is running in (set at the API call level)
- A domain classifier applied to the query or session
Domain-specific synonyms override general ones when both match the same term.
Admin Edit Interface
An admin UI lets curators add, edit, deactivate, and test synonym pairs:
- Add pair: term, synonym, direction, domain, weight → INSERT into SynonymPair
- Deactivate: toggle active=false without deletion (preserves audit history)
- Test: enter a query, see the expanded version with highlighted synonym substitutions
- Bulk import: CSV upload of term-synonym pairs for large dictionary migrations
Changes take effect at next synonym graph reload (configurable: every 5 minutes via TTL on the in-memory cache).
SQL Schema
-- Pairwise synonym relationships
CREATE TABLE SynonymPair (
id BIGSERIAL PRIMARY KEY,
term VARCHAR(256) NOT NULL,
synonym VARCHAR(256) NOT NULL,
direction VARCHAR(16) NOT NULL DEFAULT 'bidirectional', -- bidirectional / one_way
domain VARCHAR(64) NOT NULL DEFAULT 'general',
weight DOUBLE PRECISION NOT NULL DEFAULT 1.0,
active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (term, synonym, domain)
);
CREATE INDEX idx_synonympair_term ON SynonymPair(term, domain, active);
CREATE INDEX idx_synonympair_synonym ON SynonymPair(synonym, domain, active);
-- Group-based synonym sets (many-to-many cluster)
CREATE TABLE SynonymGroup (
group_name VARCHAR(128) NOT NULL,
terms TEXT[] NOT NULL, -- array of equivalent terms
domain VARCHAR(64) NOT NULL DEFAULT 'general',
PRIMARY KEY (group_name, domain)
);
-- Expansion audit log for A/B impact measurement
CREATE TABLE ExpansionLog (
query_id UUID NOT NULL,
original TEXT NOT NULL,
expanded TEXT NOT NULL,
domain VARCHAR(64) NOT NULL,
expanded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Python Implementation
import json
from typing import List, Dict, Optional
from functools import lru_cache
# In-memory synonym graph loaded from DB at startup / refreshed every 5 min
# Structure: {(term, domain): [(synonym, direction, weight), ...]}
synonym_graph: Dict[tuple, List[tuple]] = {}
def load_synonym_graph() -> None:
"""Load active synonym pairs from DB into in-memory graph."""
global synonym_graph
rows = db.execute(
"SELECT term, synonym, direction, domain, weight"
" FROM SynonymPair WHERE active = TRUE"
).fetchall()
graph: Dict[tuple, List[tuple]] = {}
for term, synonym, direction, domain, weight in rows:
key_fwd = (term.lower(), domain)
if key_fwd not in graph:
graph[key_fwd] = []
graph[key_fwd].append((synonym.lower(), direction, weight))
if direction == 'bidirectional':
key_rev = (synonym.lower(), domain)
if key_rev not in graph:
graph[key_rev] = []
graph[key_rev].append((term.lower(), direction, weight))
synonym_graph = graph
def get_synonyms(term: str, direction: str = 'both', domain: str = 'general') -> List[dict]:
"""Return synonyms for a term filtered by direction and domain."""
term = term.lower()
# Try domain-specific first, then fall back to general
entries = synonym_graph.get((term, domain), [])
if domain != 'general':
entries = entries + synonym_graph.get((term, 'general'), [])
results = []
for synonym, edge_direction, weight in entries:
if direction == 'both' or edge_direction == 'bidirectional' or direction == 'forward':
results.append({
"synonym": synonym,
"direction": edge_direction,
"weight": weight
})
# Deduplicate by synonym, keeping highest weight
seen: dict = {}
for r in results:
s = r["synonym"]
if s not in seen or seen[s]["weight"] dict:
"""Expand query terms with synonyms. Returns expanded clauses per term."""
expansion = {}
for term in query_terms:
synonyms = get_synonyms(term, domain=domain)
if strategy == 'or':
# All terms equally weighted
expansion[term] = [term] + [s["synonym"] for s in synonyms]
elif strategy == 'boost':
# Original term gets weight 2.0, synonyms get their edge weight
clauses = [(term, 2.0)]
for s in synonyms:
clauses.append((s["synonym"], s["weight"]))
expansion[term] = clauses
return expansion
def build_synonym_graph_report() -> dict:
"""Return stats on current synonym graph for monitoring."""
total_pairs = sum(len(v) for v in synonym_graph.values())
domains = set(k[1] for k in synonym_graph.keys())
return {
"unique_terms": len(synonym_graph),
"total_edges": total_pairs,
"domains": list(domains)
}
def measure_expansion_impact(query_id: str, original: str,
expanded: str, domain: str) -> None:
"""Log expansion for A/B analysis — compare CTR between expanded and control."""
db.execute(
"INSERT INTO ExpansionLog(query_id, original, expanded, domain, expanded_at)"
" VALUES(%s, %s, %s, %s, NOW())",
(query_id, original, expanded, domain)
)
A/B Testing Expansion Impact
To measure whether synonym expansion improves search quality:
- Split traffic: 50% use expansion, 50% use raw query (control).
- Measure: CTR on search results, zero-results rate, session abandonment rate.
- Log expanded query text in ExpansionLog for offline analysis.
- Run for 2 weeks minimum to cover weekly traffic patterns. Gate on statistically significant CTR lift (p < 0.05).
Key Design Decisions Summary
- Directional edges prevent noise from one-way synonym expansion (TV → series but not reverse).
- Domain specificity resolves conflicts where the same term has different meanings across verticals.
- Query-time expansion (over index-time) allows synonym updates without re-indexing documents.
- Boost expansion preserves precision by ranking original-term matches above synonym matches.
- A/B testing is mandatory — synonym quality varies widely and must be measured, not assumed.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture