Vector Search Service Low-Level Design: Embedding Index, ANN Algorithms, and Hybrid Search

Use Cases

Vector search powers applications where semantic similarity matters more than keyword overlap: semantic search over documents, recommendation retrieval (find items similar to user history), image similarity (find visually similar products), and deduplication (detect near-duplicate records). All reduce to the same problem: given a query vector, find the K most similar vectors in a large corpus.

Embedding Generation

Inputs must be encoded into dense vectors before indexing:

Text: sentence-transformers (all-MiniLM-L6-v2: 384 dims, fast), OpenAI text-embedding-3-large (3072 dims, highest quality), or custom fine-tuned encoders
Images: CLIP encodes both images and text into the same embedding space — enabling cross-modal search
Multimodal: concatenate or project modality-specific vectors into a shared space

Typical embedding dimensionality: 384–1536 float32 values. A corpus of 100M vectors at 768 dims requires ~300GB of RAM for exact search — making approximate methods mandatory at scale.

ANN Algorithms

Approximate Nearest Neighbor (ANN) algorithms trade a small recall loss for dramatic speed improvements:

Brute force: compute cosine similarity to every vector — O(n*d). Exact but infeasible beyond a few million vectors.
HNSW (Hierarchical Navigable Small Worlds): graph-based index. O(log n) search. High recall (95%+ at reasonable ef settings). Best for high-recall requirements.
IVF (Inverted File Index): partition space into K clusters using k-means. At query time, search only the nprobe nearest clusters. Faster than HNSW for very large datasets with slightly lower recall.

HNSW Deep Dive

HNSW builds a multi-layer proximity graph:

Upper layers: sparse graph with long-range connections — fast coarse navigation
Lower layers: dense graph with short-range connections — precise fine search
Construction: each new vector inserted by finding neighbors via greedy search from the entry point, then bidirectional edge insertion. O(n log n) build time.
Search: start at entry point on the top layer, greedily traverse to the query's neighborhood, descend to the next layer, repeat until layer 0. Candidate set size controlled by ef parameter — higher ef = higher recall, higher latency.

Quantization

Product Quantization (PQ) compresses vectors to reduce memory and speed up distance computation:

Split a 768-dim vector into M subvectors (e.g., 8 subvectors of 96 dims each)
Each subvector quantized to one of 256 centroids — encoded as a single byte
768 float32 (3072 bytes) → 8 uint8 (8 bytes): 384x compression
Distance computed in compressed space using lookup tables — much faster than float32 arithmetic
Recall loss: 2–5% vs exact search, acceptable for most applications

Hybrid Search

Pure vector search can miss lexically exact matches. Hybrid search combines:

Vector similarity score: cosine or dot product from ANN search
BM25 keyword score: term frequency / inverse document frequency from an inverted index (Elasticsearch/OpenSearch)

Fusion strategies: Reciprocal Rank Fusion (RRF) — combine rankings without score normalization: score = 1/(k + rank_vector) + 1/(k + rank_bm25). Alternatively, train a linear combination weight using click data.

Metadata Filtering

Real applications require filtering: “find similar products in category=shoes under price=100”. Two approaches:

Pre-filtering: apply filter to restrict the candidate set, then run ANN on the filtered subset. Accurate but requires filtered index or brute-force scan if filtered set is small.
Post-filtering: run ANN to get top-K results, then apply filter. Fast but may return fewer than K results if many candidates are filtered out.

Pinecone and Weaviate use ACORN (Approximate Closest match with filtering On Random Neighbors) to handle this efficiently.

Index Updates

HNSW does not support efficient deletes — removing a node breaks graph connectivity:

Soft delete: maintain a deletion bitmap, filter deleted IDs from results at query time
Periodic rebuild: rebuild index nightly from the current dataset, swap atomically
Incremental merge: maintain a small mutable index for new vectors, periodically merge into the main index

Faiss (Facebook AI Similarity Search) is the production-grade library implementing HNSW, IVF, and PQ — supports GPU acceleration for index building and search.

Distributed Index

For corpora too large for a single machine:

Shard vectors by ID range across multiple nodes — each node holds a partition of the index
Query fans out to all shards in parallel
Each shard returns its local top-K results
Coordinator merges results and returns global top-K

Replication (typically 2 replicas) provides fault tolerance and read throughput scaling. Milvus and Qdrant implement this distributed architecture out of the box.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an HNSW index work, and why is it preferred for approximate nearest neighbor search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hierarchical Navigable Small World (HNSW) builds a multi-layer graph where each node represents a vector. The top layers are sparse long-range graphs; lower layers are progressively denser. During search, the algorithm enters at the top layer, greedily navigates toward the query vector, then descends to the next layer using the current best node as the entry point, repeating until the bottom layer yields the ef_search nearest candidates. Insert complexity is O(log n) and query is approximately O(log n) in practice. HNSW is preferred over IVF-based indexes because it does not require a training phase, supports incremental inserts without full index rebuilds, and achieves better recall-latency tradeoffs at high recall targets (>95%). The tradeoff is higher memory usage: HNSW stores the graph structure in RAM, typically 50–100 bytes per vector beyond the raw vector data.”
}
},
{
“@type”: “Question”,
“name”: “How do you design the storage and serving layer for a billion-scale embedding index?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Partition the corpus into shards, each holding ~50–100M vectors, sized to fit the HNSW graph for that shard in a single machine's RAM. Route queries to all shards in parallel (scatter-gather), merge per-shard top-k results by score, and return the global top-k. Use product quantization (PQ) or scalar quantization (SQ8) to compress raw float32 vectors (128 dims × 4 bytes = 512 bytes) to ~32–64 bytes for in-memory indexes while storing full-precision vectors on SSD for re-ranking. Version the index: build new index snapshots offline, upload to object storage, and hot-swap on serving nodes with a blue-green deploy. A metadata store (e.g., Postgres) maps vector IDs to document IDs and enables filtering by structured attributes before or after ANN retrieval.”
}
},
{
“@type”: “Question”,
“name”: “How does hybrid search combine dense vector retrieval with sparse keyword (BM25) retrieval?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Run both retrievals in parallel: the ANN index returns top-k dense candidates with cosine similarity scores; an inverted index (e.g., Elasticsearch or a custom BM25 engine) returns top-k sparse candidates with BM25 scores. Merge using Reciprocal Rank Fusion (RRF): for each document, compute RRF_score = Σ 1/(k + rank_i) where k≈60 and rank_i is the document's rank in each result list. RRF is score-scale-agnostic — no normalization needed. Alternatively, train a linear or learned combiner on a relevance-labeled dataset to weight dense vs. sparse scores per query type. For latency, cap each retrieval to top-100 candidates before merging, then re-rank the merged set with a cross-encoder model if query latency budget allows. Hybrid consistently outperforms either modality alone on keyword-heavy or out-of-distribution queries.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle embedding index updates when source documents are edited or deleted?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HNSW does not support true deletes — removed nodes leave tombstones that degrade recall over time. Handle this with a two-phase approach: (1) maintain a delete log (a bloom filter or hash set of deleted vector IDs) checked at query time to filter tombstoned results from the candidate list; (2) schedule periodic index rebuilds (e.g., nightly) from the authoritative document store to eliminate tombstones and incorporate bulk edits. For edits, treat them as delete + re-insert: assign a new vector ID to the updated embedding and add the old ID to the delete log. For real-time freshness requirements, maintain a small in-memory ‘delta index’ (flat exact search over recent inserts) merged with ANN results at query time; the delta index is rebuilt into the main index during the next scheduled compaction.”
}
}
]
}