Synonym Expansion System: Overview
Synonym expansion augments search queries with equivalent or related terms so that a search for “automobile” also retrieves documents about “car” and “vehicle.” Low-level design covers the synonym graph data model, expansion strategies, asymmetric directionality, domain specificity, edit interface, and measurement of expansion impact on search quality.
Synonym Types
Not all synonyms are equal. The system must distinguish:
- Exact synonyms (bidirectional): car = automobile. Expansion in both directions is equally valid.
- Directional synonyms (one-way): “TV show” → “series” (expanding “TV show” to include “series” is useful, but “series” should not automatically expand to “TV show” as it would add noise for unrelated series like book series).
- Hyponyms (is-a relationships): sedan is-a car. Searching for “car” should optionally include “sedan,” but searching for “sedan” should not expand to “car” (too broad).
- Acronyms and abbreviations: ML = machine learning, NLP = natural language processing. Usually bidirectional with high confidence.
- Brand synonyms: “Kleenex” = “tissue” in consumer goods domains.
Synonym Graph
The synonym graph has terms as nodes and synonym relationships as directed edges. Each edge carries:
- direction: bidirectional or one_way (from term to synonym only)
- domain: general, tech, medical, legal, ecommerce — enables domain-specific overrides
- weight: confidence score 0.0-1.0. High-weight edges always expand. Low-weight edges expand only in boosting mode.
- active: flag for soft-disabling pairs without deletion
A synonym group (SynonymGroup) allows associating multiple terms as a cluster rather than pairwise edges, reducing the number of rows needed for large equivalence sets.
Query-Time Expansion Strategies
OR Expansion
Replace each query term with an OR clause of itself and its synonyms:
original query: "automobile repair"
expanded: "(automobile OR car OR vehicle) AND (repair OR fix OR maintenance)"
Pros: maximizes recall. Cons: can reduce precision if synonyms are noisy.
Boost Expansion
Keep original terms at full score, add synonyms with a lower score weight:
automobile^2 OR car^1 OR vehicle^0.5
Pros: original term results rank highest, synonyms fill in where original has no match. Cons: more complex query plan.
Index-Time Expansion
Expand synonyms at index time when documents are ingested. A document containing “automobile” also indexes “car” as if the document contained both.
Pros: simpler query; no query-time expansion logic.
Cons: index bloat; updating synonyms requires re-indexing all documents. Query-time expansion is preferred for agility.
Asymmetric Expansion
The one-way direction field enforces asymmetric expansion. Example: “TV” expands to {“television”, “show”, “series”} but querying for “series” does not expand to “TV” because the relationship is not reversible without adding noise from unrelated meanings of “series.”
Implementation: when building the expansion set for a query term, only follow edges where direction = 'bidirectional' OR where term = src_term AND direction = 'one_way'.
Domain-Specific Synonyms
General corpus synonyms can conflict with domain-specific meanings. Example: in programming, “Python” is a language, not a snake. In medical context, “cold” means illness, not temperature.
The domain field on SynonymPair and SynonymGroup allows the expansion engine to select synonyms matching the current search context. Context is inferred from:
- The product vertical the search is running in (set at the API call level)
- A domain classifier applied to the query or session
Domain-specific synonyms override general ones when both match the same term.
Admin Edit Interface
An admin UI lets curators add, edit, deactivate, and test synonym pairs:
- Add pair: term, synonym, direction, domain, weight → INSERT into SynonymPair
- Deactivate: toggle active=false without deletion (preserves audit history)
- Test: enter a query, see the expanded version with highlighted synonym substitutions
- Bulk import: CSV upload of term-synonym pairs for large dictionary migrations
Changes take effect at next synonym graph reload (configurable: every 5 minutes via TTL on the in-memory cache).
SQL Schema
-- Pairwise synonym relationships
CREATE TABLE SynonymPair (
id BIGSERIAL PRIMARY KEY,
term VARCHAR(256) NOT NULL,
synonym VARCHAR(256) NOT NULL,
direction VARCHAR(16) NOT NULL DEFAULT 'bidirectional', -- bidirectional / one_way
domain VARCHAR(64) NOT NULL DEFAULT 'general',
weight DOUBLE PRECISION NOT NULL DEFAULT 1.0,
active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (term, synonym, domain)
);
CREATE INDEX idx_synonympair_term ON SynonymPair(term, domain, active);
CREATE INDEX idx_synonympair_synonym ON SynonymPair(synonym, domain, active);
-- Group-based synonym sets (many-to-many cluster)
CREATE TABLE SynonymGroup (
group_name VARCHAR(128) NOT NULL,
terms TEXT[] NOT NULL, -- array of equivalent terms
domain VARCHAR(64) NOT NULL DEFAULT 'general',
PRIMARY KEY (group_name, domain)
);
-- Expansion audit log for A/B impact measurement
CREATE TABLE ExpansionLog (
query_id UUID NOT NULL,
original TEXT NOT NULL,
expanded TEXT NOT NULL,
domain VARCHAR(64) NOT NULL,
expanded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Python Implementation
import json
from typing import List, Dict, Optional
from functools import lru_cache
# In-memory synonym graph loaded from DB at startup / refreshed every 5 min
# Structure: {(term, domain): [(synonym, direction, weight), ...]}
synonym_graph: Dict[tuple, List[tuple]] = {}
def load_synonym_graph() -> None:
"""Load active synonym pairs from DB into in-memory graph."""
global synonym_graph
rows = db.execute(
"SELECT term, synonym, direction, domain, weight"
" FROM SynonymPair WHERE active = TRUE"
).fetchall()
graph: Dict[tuple, List[tuple]] = {}
for term, synonym, direction, domain, weight in rows:
key_fwd = (term.lower(), domain)
if key_fwd not in graph:
graph[key_fwd] = []
graph[key_fwd].append((synonym.lower(), direction, weight))
if direction == 'bidirectional':
key_rev = (synonym.lower(), domain)
if key_rev not in graph:
graph[key_rev] = []
graph[key_rev].append((term.lower(), direction, weight))
synonym_graph = graph
def get_synonyms(term: str, direction: str = 'both', domain: str = 'general') -> List[dict]:
"""Return synonyms for a term filtered by direction and domain."""
term = term.lower()
# Try domain-specific first, then fall back to general
entries = synonym_graph.get((term, domain), [])
if domain != 'general':
entries = entries + synonym_graph.get((term, 'general'), [])
results = []
for synonym, edge_direction, weight in entries:
if direction == 'both' or edge_direction == 'bidirectional' or direction == 'forward':
results.append({
"synonym": synonym,
"direction": edge_direction,
"weight": weight
})
# Deduplicate by synonym, keeping highest weight
seen: dict = {}
for r in results:
s = r["synonym"]
if s not in seen or seen[s]["weight"] dict:
"""Expand query terms with synonyms. Returns expanded clauses per term."""
expansion = {}
for term in query_terms:
synonyms = get_synonyms(term, domain=domain)
if strategy == 'or':
# All terms equally weighted
expansion[term] = [term] + [s["synonym"] for s in synonyms]
elif strategy == 'boost':
# Original term gets weight 2.0, synonyms get their edge weight
clauses = [(term, 2.0)]
for s in synonyms:
clauses.append((s["synonym"], s["weight"]))
expansion[term] = clauses
return expansion
def build_synonym_graph_report() -> dict:
"""Return stats on current synonym graph for monitoring."""
total_pairs = sum(len(v) for v in synonym_graph.values())
domains = set(k[1] for k in synonym_graph.keys())
return {
"unique_terms": len(synonym_graph),
"total_edges": total_pairs,
"domains": list(domains)
}
def measure_expansion_impact(query_id: str, original: str,
expanded: str, domain: str) -> None:
"""Log expansion for A/B analysis — compare CTR between expanded and control."""
db.execute(
"INSERT INTO ExpansionLog(query_id, original, expanded, domain, expanded_at)"
" VALUES(%s, %s, %s, %s, NOW())",
(query_id, original, expanded, domain)
)
A/B Testing Expansion Impact
To measure whether synonym expansion improves search quality:
- Split traffic: 50% use expansion, 50% use raw query (control).
- Measure: CTR on search results, zero-results rate, session abandonment rate.
- Log expanded query text in ExpansionLog for offline analysis.
- Run for 2 weeks minimum to cover weekly traffic patterns. Gate on statistically significant CTR lift (p < 0.05).
Key Design Decisions Summary
- Directional edges prevent noise from one-way synonym expansion (TV → series but not reverse).
- Domain specificity resolves conflicts where the same term has different meanings across verticals.
- Query-time expansion (over index-time) allows synonym updates without re-indexing documents.
- Boost expansion preserves precision by ranking original-term matches above synonym matches.
- A/B testing is mandatory — synonym quality varies widely and must be measured, not assumed.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between bidirectional and one-way synonym expansion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Bidirectional synonyms expand in both directions: searching for either term returns results for both. One-way (directional) synonyms only expand in one direction. For example, ‘TV show’ can expand to ‘series’ (to catch content labeled as a series), but ‘series’ should not expand to ‘TV show’ because ‘series’ also means book series, math series, and other unrelated concepts. One-way expansion increases recall for the source term without adding noise to the target term.”
}
},
{
“@type”: “Question”,
“name”: “Should synonym expansion happen at query time or index time?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Query-time expansion is preferred for most systems because it does not require re-indexing documents when synonyms change. Adding or updating a synonym pair takes effect immediately at the next graph reload (e.g., every 5 minutes). Index-time expansion bloats the index and ties synonym updates to expensive re-index operations. The tradeoff is that query-time expansion adds a small overhead per query (typically under 1ms for an in-memory graph lookup) and generates more complex Elasticsearch/Solr query DSL.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle domain-specific synonyms that conflict with general meanings?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Store synonyms with a domain field (e.g., tech, medical, general). At expansion time, the engine selects synonyms matching the current search context domain first. If domain-specific entries exist for a term, they take precedence over general entries. The search context domain is set at the API level (e.g., medical search portal uses domain=medical) or inferred from a domain classifier on the query. This prevents general synonyms from polluting specialized search verticals.”
}
},
{
“@type”: “Question”,
“name”: “How do you measure the impact of synonym expansion on search quality?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Run an A/B test with 50% of traffic receiving expanded queries and 50% receiving the original query. Measure click-through rate on results, zero-results rate (expansion should reduce this), and session abandonment rate. Log each expansion in an audit table for offline analysis. A successful synonym set should show a statistically significant CTR lift (p < 0.05) and a reduction in zero-result queries, without a precision drop (measured by lower result-page CTR or increased pogo-sticking)."
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between query-time and index-time synonym expansion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Query-time expansion adds synonym terms to the search query before execution, leaving the index unchanged; index-time expansion adds synonyms during document indexing so the index contains both original and synonym terms.”
}
},
{
“@type”: “Question”,
“name”: “Why use one-way synonyms instead of bidirectional?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Bidirectional expansion can introduce noise; expanding “TV” to “television” improves recall, but expanding “television” to “TV show, series” may return unwanted results; one-way expansion gives fine-grained control.”
}
},
{
“@type”: “Question”,
“name”: “How is the impact of synonym expansion measured?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An A/B test compares search sessions with and without expansion; key metrics are recall (more results found), precision (relevant results ratio), and query reformulation rate (users who rephrase after no results).”
}
},
{
“@type”: “Question”,
“name”: “How are domain-specific synonyms prioritized over general synonyms?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SynonymPair rows include a domain field; the expansion engine queries domain-specific pairs first and only falls back to general pairs if no domain match exists.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture