Question 1

How does an HNSW index work, and why is it preferred for approximate nearest neighbor search?

Accepted Answer

Hierarchical Navigable Small World (HNSW) builds a multi-layer graph where each node represents a vector. The top layers are sparse long-range graphs; lower layers are progressively denser. During search, the algorithm enters at the top layer, greedily navigates toward the query vector, then descends to the next layer using the current best node as the entry point, repeating until the bottom layer yields the ef_search nearest candidates. Insert complexity is O(log n) and query is approximately O(log n) in practice. HNSW is preferred over IVF-based indexes because it does not require a training phase, supports incremental inserts without full index rebuilds, and achieves better recall-latency tradeoffs at high recall targets (>95%). The tradeoff is higher memory usage: HNSW stores the graph structure in RAM, typically 50–100 bytes per vector beyond the raw vector data.

Question 2

How do you design the storage and serving layer for a billion-scale embedding index?

Accepted Answer

Partition the corpus into shards, each holding ~50–100M vectors, sized to fit the HNSW graph for that shard in a single machine's RAM. Route queries to all shards in parallel (scatter-gather), merge per-shard top-k results by score, and return the global top-k. Use product quantization (PQ) or scalar quantization (SQ8) to compress raw float32 vectors (128 dims × 4 bytes = 512 bytes) to ~32–64 bytes for in-memory indexes while storing full-precision vectors on SSD for re-ranking. Version the index: build new index snapshots offline, upload to object storage, and hot-swap on serving nodes with a blue-green deploy. A metadata store (e.g., Postgres) maps vector IDs to document IDs and enables filtering by structured attributes before or after ANN retrieval.

Question 3

How does hybrid search combine dense vector retrieval with sparse keyword (BM25) retrieval?

Accepted Answer

Run both retrievals in parallel: the ANN index returns top-k dense candidates with cosine similarity scores; an inverted index (e.g., Elasticsearch or a custom BM25 engine) returns top-k sparse candidates with BM25 scores. Merge using Reciprocal Rank Fusion (RRF): for each document, compute RRF_score = Σ 1/(k + rank_i) where k≈60 and rank_i is the document's rank in each result list. RRF is score-scale-agnostic — no normalization needed. Alternatively, train a linear or learned combiner on a relevance-labeled dataset to weight dense vs. sparse scores per query type. For latency, cap each retrieval to top-100 candidates before merging, then re-rank the merged set with a cross-encoder model if query latency budget allows. Hybrid consistently outperforms either modality alone on keyword-heavy or out-of-distribution queries.

Question 4

How do you handle embedding index updates when source documents are edited or deleted?

Accepted Answer

HNSW does not support true deletes — removed nodes leave tombstones that degrade recall over time. Handle this with a two-phase approach: (1) maintain a delete log (a bloom filter or hash set of deleted vector IDs) checked at query time to filter tombstoned results from the candidate list; (2) schedule periodic index rebuilds (e.g., nightly) from the authoritative document store to eliminate tombstones and incorporate bulk edits. For edits, treat them as delete + re-insert: assign a new vector ID to the updated embedding and add the old ID to the delete log. For real-time freshness requirements, maintain a small in-memory 'delta index' (flat exact search over recent inserts) merged with ANN results at query time; the delta index is rebuilt into the main index during the next scheduled compaction.

Vector Search Service Low-Level Design: Embedding Index, ANN Algorithms, and Hybrid Search

Use Cases

Embedding Generation

ANN Algorithms

HNSW Deep Dive

Quantization

Hybrid Search

Metadata Filtering

Index Updates

Distributed Index