Low Level Design: Semantic Search Service

What Is a Semantic Search Service?

A Semantic Search Service finds documents that are conceptually relevant to a query even when they share no keywords. Instead of matching on term overlap, it encodes both queries and documents into dense vector embeddings and retrieves the nearest neighbors in embedding space. This approach captures paraphrase relationships, synonym equivalence, and cross-lingual similarity that keyword search misses entirely.

Data Model and Schema

The service maintains an embedding store alongside the traditional inverted index.

-- Document embedding table (stored in a vector DB such as Pinecone, Weaviate, or pgvector)
CREATE TABLE doc_embeddings (
    doc_id      BIGINT PRIMARY KEY,
    embedding   VECTOR(768),       -- dense float32 vector
    model_ver   VARCHAR(32),       -- embedding model version tag
    indexed_at  TIMESTAMP
);

-- Approximate Nearest Neighbor index metadata
CREATE TABLE ann_indexes (
    index_id    SERIAL PRIMARY KEY,
    model_ver   VARCHAR(32),
    status      VARCHAR(16),       -- 'building' | 'active' | 'deprecated'
    shard_count INT,
    created_at  TIMESTAMP
);

Embeddings are produced offline by a bi-encoder model (e.g., sentence-transformers/all-mpnet-base-v2) and stored in a vector database that supports fast ANN queries. The query embedding is computed online at request time by the same encoder.

Core Algorithm and Workflow

Step 1 — Query encoding: The raw (or QUS-enriched) query is passed through the bi-encoder. The encoder is a Transformer whose [CLS] token output is L2-normalized to produce a unit vector. Inference runs on CPU in ~5 ms for a typical query length.

Step 2 — Approximate Nearest Neighbor (ANN) retrieval: The query vector is sent to the vector database, which uses an HNSW (Hierarchical Navigable Small World) index to retrieve the top-K most similar document vectors in sub-10 ms for corpora up to hundreds of millions of documents. HNSW trades a small recall loss (typically 95–99% recall@100) for a 100–1000x speedup over exact search.

Step 3 — Hybrid fusion: Semantic results are merged with BM25 keyword results using Reciprocal Rank Fusion (RRF). RRF combines ranked lists without requiring score normalization:

RRF_score(doc) = sum over lists of 1 / (k + rank_in_list)

k = 60 is a standard default. This hybrid approach consistently outperforms either method alone.

Step 4 — Re-ranking: The fused candidate set (top 100–200 docs) is optionally passed to a cross-encoder re-ranker that attends jointly over (query, document) pairs for higher precision. Cross-encoders are slower (they cannot pre-compute document representations) but more accurate, making them suitable for the final re-ranking stage over a small candidate set.

Step 5 — Response assembly: Top-N documents with their scores and embedding metadata are returned to the caller.

Failure Handling and Latency

Vector DB timeout: If the ANN query exceeds 15 ms, fall back to BM25-only results. The service must never block on a slow vector store.
Model cold start: Pre-load the bi-encoder into shared memory at pod startup. Use a readiness probe that rejects traffic until the model is warm.
Index staleness: New documents are indexed asynchronously. Maintain a small real-time index for documents ingested in the last 60 seconds and merge results at query time.
Shard failure: Partition the HNSW index across shards. If one shard is unavailable, return results from the remaining shards with a degraded-recall flag in the response metadata.

Scalability Considerations

Offline embedding pipeline: Use a Spark or Ray batch job to encode all documents. For a 100 M document corpus with 768-dim embeddings, storage is ~300 GB (float32). Quantize to int8 to cut this to ~75 GB with minimal recall loss.
HNSW index build: Build indexes offline and swap atomically (blue-green). Index build for 100 M docs takes hours; serving must not be disrupted during rebuilds.
Query encoder scaling: The query encoder is stateless and CPU-bound for short queries. One pod handles ~200 QPS on a modern CPU core. Scale horizontally.
Model versioning: When the embedding model changes, all document embeddings must be recomputed. Use the model_ver column to serve queries against the correct index version during rollover.
Multilingual support: Use a multilingual bi-encoder (e.g., paraphrase-multilingual-mpnet-base-v2) to serve cross-lingual queries without per-language index shards.

Summary

A Semantic Search Service extends keyword search with dense vector retrieval using bi-encoders and HNSW ANN indexes. Hybrid fusion with BM25 via RRF delivers the best of both worlds. A cross-encoder re-ranker on a small candidate set pushes precision higher at acceptable latency cost. Reliability comes from graceful fallback to keyword search; scalability comes from offline embedding pipelines, quantized HNSW indexes, and stateless query encoder pods.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is semantic search and how does it differ from keyword search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Semantic search matches documents to queries based on meaning rather than exact token overlap. It uses dense vector embeddings from models like BERT or sentence transformers to represent both queries and documents in a shared vector space, enabling approximate nearest-neighbor retrieval that captures synonyms and conceptual similarity.”
}
},
{
“@type”: “Question”,
“name”: “How do you design the indexing layer for a semantic search system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The indexing layer encodes documents into dense vectors using a pre-trained or fine-tuned embedding model, then stores them in an ANN index such as FAISS, ScaNN, or Pinecone. An incremental update pipeline re-embeds changed documents and merges them into the live index with minimal read latency impact.”
}
},
{
“@type”: “Question”,
“name”: “How does Databricks approach semantic search for enterprise data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Databricks leverages its Lakehouse architecture to store and version embedding vectors alongside raw data in Delta Lake. Embedding generation is parallelized via Spark, and the resulting vectors are served through a managed vector search index that integrates with MLflow for model versioning and experiment tracking.”
}
},
{
“@type”: “Question”,
“name”: “What are the main challenges of scaling semantic search to billions of documents?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Key challenges include keeping ANN index size manageable through sharding and quantization (e.g., product quantization), controlling embedding model inference costs with batching and GPU acceleration, maintaining index freshness with near-real-time update pipelines, and balancing recall vs. latency trade-offs through careful HNSW or IVF parameter tuning.”
}
}
]
}