What is a Vector Database?
A vector database stores high-dimensional numerical vectors (embeddings) and supports efficient similarity search: “find the K vectors most similar to this query vector.” Embeddings are produced by ML models — a sentence embedding might be 1536 dimensions (OpenAI text-embedding-3-small), an image embedding 2048 dimensions (ResNet). Vector databases power semantic search (find documents by meaning, not keywords), recommendation systems (find items similar to what a user liked), RAG (Retrieval-Augmented Generation for LLM applications), and duplicate detection. Systems: Pinecone, Weaviate, Qdrant, Milvus, pgvector (PostgreSQL extension). This is increasingly asked at AI-focused companies (OpenAI, Cohere, Anthropic, Databricks) and any company building AI features.
Approximate Nearest Neighbor (ANN) Search
Exact nearest neighbor (brute force): compute cosine similarity or L2 distance between the query vector and every stored vector. O(N * D) where N = number of vectors, D = dimensions. For 100M vectors at 1536 dimensions: 100M * 1536 = 153 billion multiplications per query. At ~1ns per operation: 153 seconds per query. Unacceptable. ANN algorithms trade a small accuracy loss for orders-of-magnitude speedup. Three main approaches: HNSW (Hierarchical Navigable Small World graphs): builds a multi-layer graph where each node connects to nearby nodes. Search traverses from a coarse upper layer to a fine lower layer, following edges greedily toward the query. O(log N) average query time. 99%+ recall at typical settings. The gold standard for production ANN. IVF (Inverted File Index): partition vectors into K clusters (k-means). At query time, search only the C closest clusters (C << K). Reduces comparisons from N to N/K * C. Less accurate than HNSW but simpler and memory-efficient. LSH (Locality-Sensitive Hashing): hash similar vectors to the same bucket with high probability. Very fast but lower recall than HNSW/IVF. Rarely used in production today. HNSW performance: 100M vectors, 1536 dimensions, HNSW with ef_search=100: ~5ms per query at 98% recall on a single machine. Index build time: O(N log N), takes hours for 100M vectors.
Architecture
Data model: each vector has an id, vector (float32 array), and optional metadata ({title, author, date, category}). Metadata is used for pre-filtering (search only in category=’finance’). Write path: client sends (id, vector, metadata) to the API. API validates dimension (must match index configuration). Vector is written to the HNSW index (update the graph) + to the metadata store (PostgreSQL or DynamoDB for filter queries). Write throughput: HNSW inserts are O(log N) but require updating graph edges — slower than a simple append-only store. Typical throughput: 1K-10K inserts/second on a single node. Read path: client sends a query vector + k + optional metadata filter. Pre-filtering: if a metadata filter is specified, retrieve the set of IDs matching the filter from the metadata store, then search only within that subset. Post-filtering: search ANN across all vectors, then filter the top-k results by metadata. Pre-filtering is more accurate but slower (smaller candidate set for ANN); post-filtering is faster but may return fewer than k results if many are filtered out. Sharding: partition vectors across shards (by id range or consistent hashing). Each shard maintains its own HNSW index. Query is broadcast to all shards; results are merged (top-k from each shard → global top-k). Replication: each shard has multiple replicas. Reads are load-balanced. Writes go to all replicas.
Quantization and Compression
100M vectors * 1536 dimensions * 4 bytes (float32) = 614GB of raw vector data. Too large for RAM on a single machine. Quantization reduces memory: Product Quantization (PQ): divide each vector into M sub-vectors, quantize each to one of 256 centroids (1 byte each). 1536-dim float32 → 192 bytes (8x compression, negligible accuracy loss). Scalar Quantization (SQ): quantize each float32 to int8. 4x compression (1536 * 1 byte = 1.5KB vs 6KB). Fast hardware support (int8 multiply-accumulate). Binary quantization: each float → 1 bit. 32x compression, significant accuracy loss, used with re-ranking. Re-ranking: first pass with compressed vectors (fast, approximate), then re-score top-2K candidates with original float32 vectors (slower, accurate). Returns top-K accurate results at 10-20x lower memory footprint. HNSW + PQ (IVFPQ): combine IVF partitioning and PQ quantization. The most memory-efficient production configuration for very large indexes.
Interview Tips
- HNSW is the standard algorithm to know — explain multi-layer graph, O(log N) search, and recall vs. latency tradeoff.
- Distinguish exact nearest neighbor (impossible at scale) from ANN (practical with 98%+ recall).
- Quantization (PQ, SQ) is how you fit 100M vectors in RAM — mention it when scale is discussed.
- Pre-filter vs post-filter is a key design tradeoff for metadata-filtered search.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does HNSW (Hierarchical Navigable Small World) enable fast ANN search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”HNSW builds a multi-layer graph. The top layers are sparse long-range connections (coarse navigation); lower layers are dense short-range connections (fine-grained search). To search: start at the top layer, greedily traverse toward the query vector, descend to the next layer when no closer neighbor is found in the current layer, and repeat until the bottom layer. This greedy traversal runs in O(log N) on average, compared to O(N) for brute force.”}},{“@type”:”Question”,”name”:”What is product quantization and how does it compress vectors?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Product quantization divides each high-dimensional vector into M equal sub-vectors. Each sub-vector is quantized to one of 256 centroids (fit by k-means on the training data), represented by 1 byte. A 1536-dimensional float32 vector (6144 bytes) split into 192 sub-vectors of 8 dimensions each becomes 192 bytes — an 8x compression. Distance computations use precomputed lookup tables for sub-vector distances, maintaining fast approximate search with negligible accuracy loss at typical recall targets.”}},{“@type”:”Question”,”name”:”What is the difference between pre-filtering and post-filtering in vector search?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pre-filtering: apply metadata filters before ANN search — retrieve matching IDs from the metadata store, then search ANN only within that subset. More accurate (all returned results pass the filter) but slower (ANN on a smaller, potentially sparse set). Post-filtering: run ANN across all vectors, then filter the top-K results. Faster (full ANN index used) but may return fewer than K results if many are filtered out. Most production systems use pre-filtering for small filtered sets and post-filtering for large sets.”}},{“@type”:”Question”,”name”:”How do you scale a vector database beyond a single machine?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Shard the vector index across multiple machines (by ID range or consistent hashing). Each shard maintains its own HNSW index and serves queries for its partition. On query: broadcast to all shards, each returns its top-K candidates, the coordinator merges and returns the global top-K. Replication: each shard has multiple replicas for fault tolerance and read throughput. This horizontal scaling supports 100M+ vectors while keeping per-shard memory manageable (e.g., 10M vectors per shard).”}},{“@type”:”Question”,”name”:”What are vector databases used for in LLM applications (RAG)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In Retrieval-Augmented Generation (RAG), a knowledge base of documents is pre-embedded using an embedding model and stored in a vector database. At inference time, the user's query is embedded with the same model, and the K most semantically similar documents are retrieved from the vector database. These retrieved documents are injected into the LLM's context window as grounding information, allowing the LLM to answer questions about private or up-to-date knowledge it was not trained on.”}}]}
See also: Databricks Interview Prep
See also: Meta Interview Prep
See also: Atlassian Interview Prep