How does HNSW algorithm enable fast vector search?

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. Upper layers have long-range connections for coarse navigation. Lower layers have short-range connections for fine-grained search. Query: start at the top layer, greedily navigate to the nearest node, drop to next layer, repeat to the bottom. At the bottom: explore neighbors for the K nearest. Time: O(log N). Memory: O(N * M) where M is connections per node (16-64). HNSW achieves 99%+ recall with sub-millisecond queries on 1M vectors. Compare to brute force: O(N*D) which takes ~100ms on 10M vectors. Alternative: IVF-PQ partitions vectors into clusters (IVF) and compresses with Product Quantization (PQ). 10-50x less memory than HNSW but slightly lower recall. Used by FAISS for billion-scale. Choose HNSW for best recall/speed (if memory allows), IVF-PQ for memory-constrained billion-scale.

Which vector database should you use for RAG?

Depends on scale and existing infrastructure: Under 1M documents with PostgreSQL: pgvector -- add a vector column, create HNSW index, query with ORDER BY embedding query LIMIT K. No new infrastructure. 1M-100M: Pinecone (fully managed, zero ops) or Qdrant (open-source, high performance). Both support metadata filtering and hybrid search. 100M+: Milvus (distributed, billion-scale) or custom FAISS deployment. Existing Elasticsearch: use kNN dense vector fields -- combine vector search with BM25 text search in one query. Existing Redis: Redis Stack vector similarity for moderate counts. For most RAG applications: start with pgvector (simplest). Migrate to Pinecone/Qdrant when you need: dedicated vector infrastructure, more advanced filtering, or scale beyond PostgreSQL capacity.

AI/ML Interview: Vector Databases — Embedding Search, FAISS, Pinecone, pgvector, ANN Algorithms (HNSW, IVF)

⏱ 6 min read

Vector databases power semantic search, recommendation systems, and RAG pipelines by storing and querying high-dimensional embedding vectors. Understanding how vector search works internally — from embedding generation to approximate nearest neighbor algorithms — is essential for ML engineering interviews. This guide covers the algorithms, tools, and architecture of vector search at scale.

Embeddings and Similarity

An embedding is a dense vector representation of data (text, images, audio) in a continuous vector space where semantically similar items are close together. Text embeddings: models like OpenAI text-embedding-3-small, Cohere embed-v3, or open-source E5/BGE encode text into 384-1536 dimensional vectors. “How do I reset my password?” and “I forgot my login credentials” have high cosine similarity (~0.9) despite sharing no words. Image embeddings: CLIP, ResNet, or ViT encode images into vectors where visually similar images are close. Similarity metrics: (1) Cosine similarity — the cosine of the angle between vectors. Range [-1, 1]. Most common for text. (2) Euclidean distance (L2) — straight-line distance. Lower = more similar. Common for image embeddings. (3) Inner product (dot product) — fast to compute. Equivalent to cosine similarity when vectors are normalized. The choice of metric depends on how the embedding model was trained. Most text models use cosine similarity. Check the model documentation.

Exact vs Approximate Nearest Neighbor

Given a query vector, find the K most similar vectors in the database. Exact nearest neighbor (brute force): compute similarity between the query and every vector in the database. Return the top-K. Time: O(N * D) where N is vectors and D is dimensions. For 10 million vectors at 768 dimensions: 10M * 768 multiplications per query. With SIMD optimization: ~100ms on a single CPU core. Acceptable for small datasets but too slow for billions of vectors. Approximate nearest neighbor (ANN): trade accuracy for speed. Return vectors that are approximately the K nearest (may miss some true nearest neighbors). Recall@K: the fraction of true top-K that the ANN algorithm finds. Typical target: 95-99% recall. ANN algorithms achieve 10-100x speedup over brute force with 95%+ recall. This tradeoff is acceptable for: search (showing 9 of the true top 10 results is fine), recommendations (missing one good recommendation among 10 is unnoticeable), and RAG (retrieving 4 of the 5 most relevant chunks is sufficient). For applications requiring exact results (deduplication, legal discovery): use brute force or exact methods.

ANN Algorithms: HNSW and IVF

HNSW (Hierarchical Navigable Small World): builds a multi-layer graph where each node is a vector. Upper layers have long-range connections (coarse navigation). Lower layers have short-range connections (fine-grained search). Query: start at the top layer, greedily navigate to the nearest node, drop to the next layer, repeat until the bottom layer. At the bottom: explore neighbors to find the K nearest. Time: O(log N) per query. Memory: O(N * M) where M is the average number of connections per node (typically 16-64). HNSW excels at: high recall (99%+), fast query (<1ms for 1M vectors), and dynamic inserts (add vectors without rebuilding). Downside: high memory usage (the graph is in memory). IVF (Inverted File Index): partition vectors into C clusters using k-means. Each cluster has a centroid. Query: compute distance to all centroids, select the top nprobe clusters (typically 10-50), search only vectors within those clusters. Time: O(C + nprobe * N/C). Much faster than brute force when nprobe << C. Memory: lower than HNSW (no graph). Downside: lower recall for vectors near cluster boundaries. IVF-PQ (Product Quantization): compress vectors using PQ (split each vector into subvectors, quantize each to a codebook entry). Reduces memory by 10-50x. Used by FAISS for billion-scale search. Choice: HNSW for highest recall and fastest queries (if memory allows). IVF-PQ for billion-scale with memory constraints.

Vector Database Options

Purpose-built vector databases: (1) Pinecone — fully managed. Upload vectors, query. Handles scaling, replication, and index optimization automatically. Supports metadata filtering (filter by category THEN vector search). Best for: teams wanting zero infrastructure management. (2) Weaviate — open-source, supports hybrid search (vector + BM25 keyword). Built-in vectorization (auto-embed text on insert). GraphQL API. (3) Qdrant — open-source, Rust-based (high performance). Supports filtering, multi-tenancy, and quantization. REST and gRPC APIs. (4) Milvus — open-source, designed for billion-scale. Supports multiple index types (HNSW, IVF, DiskANN). Distributed architecture with separate storage and compute. Vector search in existing databases: (5) pgvector — PostgreSQL extension. Add a vector column, create an HNSW or IVF index, query with ORDER BY embedding query_embedding LIMIT K. Best for: teams already using PostgreSQL who want to avoid a separate system. Handles millions of vectors well. (6) Elasticsearch kNN — dense vector fields with HNSW index. Combine vector search with traditional text search in one query. Best for: existing Elasticsearch users needing hybrid search. (7) Redis Vector Similarity — Redis Stack adds vector indexing (HNSW or FLAT). Best for: existing Redis users with moderate vector counts. For RAG with < 1M documents: pgvector is simplest (no new infrastructure). For 1M-100M: Pinecone or Qdrant. For 100M+: Milvus or custom FAISS deployment.

Production Vector Search Architecture

A production RAG/search system: (1) Embedding pipeline — documents are chunked, embedded (batch processing with the embedding model), and upserted to the vector database. For initial load: batch embed all documents. For updates: embed new/modified documents and upsert (the vector DB handles index updates). (2) Query path — user query is embedded (single inference call, ~10ms), then ANN search in the vector database (~5ms for HNSW on 1M vectors), metadata filters applied (category, date range, access control), and top-K results returned with similarity scores. (3) Re-ranking — the initial ANN retrieval uses the embedding model (bi-encoder: independent encoding of query and document). A cross-encoder re-ranker (processes query and document together) produces more accurate relevance scores but is slower (50ms per document). Apply re-ranking to only the top-K candidates (K=20-50), return the top-N (N=5-10). This two-stage retrieval (fast ANN + accurate re-rank) balances speed and quality. Monitoring: track recall (compare ANN results with brute force periodically), latency (P50, P99 of query time), and index health (vector count, segment count, memory usage). Alert on: recall drop (index may need rebuilding), latency spike (index too large for memory, spilling to disk), and embedding model drift (new documents produce different distributions than training data).