Vector Search Service Low-Level Design: Embedding Index, ANN Algorithms, and Hybrid Search

Use Cases

Vector search powers applications where semantic similarity matters more than keyword overlap: semantic search over documents, recommendation retrieval (find items similar to user history), image similarity (find visually similar products), and deduplication (detect near-duplicate records). All reduce to the same problem: given a query vector, find the K most similar vectors in a large corpus.

Embedding Generation

Inputs must be encoded into dense vectors before indexing:

  • Text: sentence-transformers (all-MiniLM-L6-v2: 384 dims, fast), OpenAI text-embedding-3-large (3072 dims, highest quality), or custom fine-tuned encoders
  • Images: CLIP encodes both images and text into the same embedding space — enabling cross-modal search
  • Multimodal: concatenate or project modality-specific vectors into a shared space

Typical embedding dimensionality: 384–1536 float32 values. A corpus of 100M vectors at 768 dims requires ~300GB of RAM for exact search — making approximate methods mandatory at scale.

ANN Algorithms

Approximate Nearest Neighbor (ANN) algorithms trade a small recall loss for dramatic speed improvements:

  • Brute force: compute cosine similarity to every vector — O(n*d). Exact but infeasible beyond a few million vectors.
  • HNSW (Hierarchical Navigable Small Worlds): graph-based index. O(log n) search. High recall (95%+ at reasonable ef settings). Best for high-recall requirements.
  • IVF (Inverted File Index): partition space into K clusters using k-means. At query time, search only the nprobe nearest clusters. Faster than HNSW for very large datasets with slightly lower recall.

HNSW Deep Dive

HNSW builds a multi-layer proximity graph:

  • Upper layers: sparse graph with long-range connections — fast coarse navigation
  • Lower layers: dense graph with short-range connections — precise fine search
  • Construction: each new vector inserted by finding neighbors via greedy search from the entry point, then bidirectional edge insertion. O(n log n) build time.
  • Search: start at entry point on the top layer, greedily traverse to the query's neighborhood, descend to the next layer, repeat until layer 0. Candidate set size controlled by ef parameter — higher ef = higher recall, higher latency.

Quantization

Product Quantization (PQ) compresses vectors to reduce memory and speed up distance computation:

  • Split a 768-dim vector into M subvectors (e.g., 8 subvectors of 96 dims each)
  • Each subvector quantized to one of 256 centroids — encoded as a single byte
  • 768 float32 (3072 bytes) → 8 uint8 (8 bytes): 384x compression
  • Distance computed in compressed space using lookup tables — much faster than float32 arithmetic
  • Recall loss: 2–5% vs exact search, acceptable for most applications

Pure vector search can miss lexically exact matches. Hybrid search combines:

  • Vector similarity score: cosine or dot product from ANN search
  • BM25 keyword score: term frequency / inverse document frequency from an inverted index (Elasticsearch/OpenSearch)

Fusion strategies: Reciprocal Rank Fusion (RRF) — combine rankings without score normalization: score = 1/(k + rank_vector) + 1/(k + rank_bm25). Alternatively, train a linear combination weight using click data.

Metadata Filtering

Real applications require filtering: “find similar products in category=shoes under price=100”. Two approaches:

  • Pre-filtering: apply filter to restrict the candidate set, then run ANN on the filtered subset. Accurate but requires filtered index or brute-force scan if filtered set is small.
  • Post-filtering: run ANN to get top-K results, then apply filter. Fast but may return fewer than K results if many candidates are filtered out.

Pinecone and Weaviate use ACORN (Approximate Closest match with filtering On Random Neighbors) to handle this efficiently.

Index Updates

HNSW does not support efficient deletes — removing a node breaks graph connectivity:

  • Soft delete: maintain a deletion bitmap, filter deleted IDs from results at query time
  • Periodic rebuild: rebuild index nightly from the current dataset, swap atomically
  • Incremental merge: maintain a small mutable index for new vectors, periodically merge into the main index

Faiss (Facebook AI Similarity Search) is the production-grade library implementing HNSW, IVF, and PQ — supports GPU acceleration for index building and search.

Distributed Index

For corpora too large for a single machine:

  • Shard vectors by ID range across multiple nodes — each node holds a partition of the index
  • Query fans out to all shards in parallel
  • Each shard returns its local top-K results
  • Coordinator merges results and returns global top-K

Replication (typically 2 replicas) provides fault tolerance and read throughput scaling. Milvus and Qdrant implement this distributed architecture out of the box.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top