Low Level Design: Vector Database Design

⏱ 3 min read

A vector database stores high-dimensional vector embeddings and enables fast approximate nearest neighbor (ANN) search — finding the K vectors most similar to a query vector. This is the core infrastructure for semantic search, recommendation systems, RAG (retrieval-augmented generation) for LLMs, image similarity, and anomaly detection. Systems like Pinecone, Weaviate, Qdrant, Milvus, and pgvector (PostgreSQL extension) implement vector search at scale.

Vector Embeddings and Similarity Metrics

An embedding is a dense float vector (e.g., 768 or 1536 dimensions) produced by a neural network that encodes semantic meaning. Two similar items (related documents, visually similar images) produce vectors with high cosine similarity. Three similarity metrics: Cosine similarity: measures angle between vectors, range [-1, 1]. Preferred for text embeddings. Dot product: proportional to magnitude × cosine similarity. Used when magnitude encodes relevance (e.g., ColBERT). Euclidean distance (L2): geometric distance. Used for image embeddings. Normalize vectors to unit length to make cosine similarity and dot product equivalent.

HNSW: Hierarchical Navigable Small World

HNSW is the dominant ANN index algorithm. It builds a multi-layer graph where each node connects to its nearest neighbors. Higher layers have fewer nodes and longer-range connections (like motorways), lower layers are dense with short-range connections. Search starts at the top layer, greedily traverses to neighbors closer to the query, drops to the next layer, repeats until reaching layer 0 where the final k-nearest neighbors are found. HNSW achieves recall above 99% at search speeds millions of times faster than brute-force linear scan, with O(log n) amortized build complexity per node.

// HNSW query pseudocode
func search(query []float32, k int) []Result {
    // Start from entry point at top layer
    ep := hnsw.entryPoint
    for layer := hnsw.maxLayer; layer > 0; layer-- {
        // Greedy search: find closest neighbor at this layer
        ep = greedySearch(query, ep, layer)
        // Use that as entry point for the next layer
    }
    // At layer 0: beam search with ef candidates to find k nearest
    candidates := beamSearch(query, ep, layer0, efSearch)
    return topK(candidates, k)
}

IVF: Inverted File Index for Sharded Search

IVF (Inverted File Index) first clusters vectors into C centroids using k-means, then assigns each vector to its nearest centroid. At query time: find the nprobe closest centroids to the query vector, search only those clusters (not the full dataset). With 1M vectors and C=1000 clusters, searching nprobe=10 clusters checks ~10,000 vectors instead of 1M — 100x speedup. IVF is combined with product quantization (PQ) for compression: divide each vector into M sub-vectors, quantize each to a codebook entry, reducing 768×4 bytes to M×1 byte per vector.

Metadata Filtering

Production vector search requires pre-filtering (find the 10 most similar product images that are in stock and priced under $100). Two approaches: Pre-filter: apply metadata filter first to get candidate set, then run ANN on that subset. Accurate but slow if the filter is broad. Post-filter: run ANN to get top-K, then apply metadata filter. Fast but may return fewer than K results if many are filtered out. Best practice: maintain a separate bitmap index on metadata fields; use it to restrict the ANN search graph traversal during HNSW search (graph-filtered search).

Key Interview Discussion Points

Recall vs. latency: HNSW efSearch parameter trades recall for speed; higher efSearch = more accurate but slower
Index build time: HNSW build is O(n log n) but memory-intensive for large datasets; IVF build requires k-means clustering first
Sharding: partition vectors across shards by ID; query fans out to all shards, each returns local top-K, merge globally to get final top-K
Embedding drift: if the embedding model changes (new version), all stored vectors must be re-embedded — plan for this with versioned embedding namespaces
RAG pipeline: query → embed query → vector search (top-K chunks) → LLM prompt with retrieved context → generated response