Vector Database Low-Level Design: Embeddings Storage, ANN Search, and Hybrid Filtering

A vector database stores high-dimensional embedding vectors and enables fast approximate nearest neighbor (ANN) search — finding the top-K vectors most similar to a query vector. It is the backbone of semantic search, RAG (retrieval-augmented generation), recommendation systems, and multimodal search.

Embedding Storage

Embeddings are fixed-dimension float32 arrays. Common dimensions: 384 (MiniLM), 768 (BERT), 1536 (text-embedding-3-small), 3072 (text-embedding-3-large). A single float32 vector of 1536 dimensions occupies 6 KB of raw storage.

Each vector record stores:

  • vector: raw float32 bytes (or quantized int8 for compression).
  • entity_id: foreign key linking back to the original document, image, or product.
  • namespace: logical partition — separate index per use case (documents, images, users).
  • model_version: which embedding model produced this vector.
  • metadata: filterable fields (category, language, user_id, date) stored as JSONB alongside the vector.

Scalar quantization: converting float32 to int8 reduces storage by 4x with a small recall penalty (~1–2%). Binary quantization (1 bit per dimension) reduces by 32x at larger recall cost, suitable for first-pass retrieval followed by float32 re-ranking.

HNSW Index: Structure and Parameters

HNSW builds a layered proximity graph:

  • Layer 0: contains all nodes with short-range edges.
  • Layers 1..L: exponentially fewer nodes; each node has long-range connections for fast global navigation.

Key parameters:

  • M: number of bidirectional connections per node per layer. Higher M → better recall, more memory (memory ≈ M * 2 * num_vectors * 8 bytes for pointers). Typical M = 16–64.
  • ef_construction: search breadth during index build. Higher ef_construction → better graph quality → better recall at query time, but slower build. Typical ef_construction = 200–400.
  • ef_search (ef at query time): candidate list size during search. Higher ef_search → better recall, slower query. Tune ef_search to hit a target recall (e.g., 0.95) at acceptable latency.

Search: enter at the top layer at a fixed entry point, greedily move to the neighbor closest to the query, repeat per layer, descend to layer 0, collect ef_search candidates, return top-K by distance.

IVF Index: Cluster Centroids and Posting Lists

IVF runs k-means clustering on the training set to produce K cluster centroids. Each vector is assigned to its nearest centroid. At query time:

  1. Compute distance from query vector to all K centroids (fast, K is small — typically 1024–65536).
  2. Select the nprobe nearest centroids.
  3. Scan all vectors in those nprobe posting lists, compute distances, return top-K.

IVF excels for large static datasets (billions of vectors) where the training data is representative. Recall depends on nprobe: higher nprobe approaches brute-force recall but increases latency. IVF+PQ (product quantization) further compresses vectors within each posting list for memory efficiency.

Hybrid Search: Pre-filter vs Post-filter

Pre-filter strategy: apply metadata filter first (e.g., WHERE category = 'electronics' AND language = 'en'), then run ANN on the resulting entity_id set. For small result sets, brute-force L2 scan is faster than traversing the HNSW graph on a sparse subset. Guarantees recall within the filter but is slow for large filtered sets.

Post-filter strategy: run ANN on the full index, retrieve top-K * multiplier candidates, then apply metadata filter. Fast but may yield fewer than K results. Mitigated by over-fetching (retrieve 10x K and filter down).

Hybrid approach: use the filter's selectivity to decide strategy at query time. If estimated filtered set size is under a threshold (e.g., 10,000 vectors), use brute-force pre-filter. Otherwise, use post-filter with over-fetching.

Distance Metrics

  • Cosine similarity: angle between vectors; magnitude-invariant. Standard for L2-normalized text embeddings.
  • Dot product: equivalent to cosine when vectors are unit-normalized; encodes magnitude when not. Used in recommendation where magnitude encodes popularity.
  • L2 (Euclidean) distance: absolute spatial distance. Used when embedding magnitude is semantically meaningful.

Most embedding APIs (OpenAI, Cohere) return L2-normalized vectors. Using cosine or dot product on these is equivalent, but dot product is computationally cheaper (no normalization step at query time).

SQL DDL

-- Vector records with metadata for hybrid search
CREATE TABLE VectorRecord (
    id            BIGSERIAL PRIMARY KEY,
    namespace     VARCHAR(128)  NOT NULL,
    entity_id     BIGINT        NOT NULL,
    vector        BYTEA         NOT NULL,   -- float32 array serialized
    metadata      JSONB         NOT NULL DEFAULT '{}',
    model_version VARCHAR(64)   NOT NULL,
    created_at    TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (namespace, entity_id, model_version)
);

CREATE INDEX idx_vector_namespace ON VectorRecord (namespace, model_version);
CREATE INDEX idx_vector_metadata ON VectorRecord USING GIN (metadata);

-- HNSW index configuration metadata
CREATE TABLE HNSWIndex (
    id               BIGSERIAL PRIMARY KEY,
    namespace        VARCHAR(128)  NOT NULL,
    m                INTEGER       NOT NULL,
    ef_construction  INTEGER       NOT NULL,
    dimension        INTEGER       NOT NULL,
    created_at       TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (namespace)
);

Python: Core Operations

import numpy as np
import struct
from typing import Any

try:
    import hnswlib
    _HAS_HNSW = True
except ImportError:
    _HAS_HNSW = False

# In-memory index registry: namespace -> hnswlib.Index
_indexes: dict[str, Any] = {}

def _get_or_create_index(namespace: str, dim: int, max_elements: int = 1_000_000) -> Any:
    if namespace not in _indexes and _HAS_HNSW:
        idx = hnswlib.Index(space='cosine', dim=dim)
        idx.init_index(max_elements=max_elements, ef_construction=200, M=16)
        idx.set_ef(50)
        _indexes[namespace] = idx
    return _indexes.get(namespace)

def upsert_vector(namespace: str, entity_id: int, vector: list[float], metadata: dict) -> None:
    """Insert or update a vector in the given namespace."""
    vec = np.array(vector, dtype=np.float32)
    vec = vec / (np.linalg.norm(vec) + 1e-9)  # L2 normalize
    idx = _get_or_create_index(namespace, len(vector))
    if idx:
        idx.add_items(vec.reshape(1, -1), [entity_id])

def search(namespace: str, query_vector: list[float], top_k: int,
           filter_fn=None) -> list[dict]:
    """ANN search with optional post-filter function."""
    q = np.array(query_vector, dtype=np.float32)
    q = q / (np.linalg.norm(q) + 1e-9)
    idx = _indexes.get(namespace)
    if not idx:
        return []
    fetch_k = top_k * 10 if filter_fn else top_k
    labels, distances = idx.knn_query(q.reshape(1, -1), k=min(fetch_k, idx.element_count))
    results = []
    for label, dist in zip(labels[0], distances[0]):
        record = {'entity_id': int(label), 'score': float(1 - dist)}
        if filter_fn is None or filter_fn(record):
            results.append(record)
            if len(results) >= top_k:
                break
    return results

def rebuild_index(namespace: str) -> None:
    """Drop and rebuild the HNSW index for a namespace from stored vectors."""
    # In production: load all VectorRecord rows for namespace, re-add to fresh index
    _indexes.pop(namespace, None)
    print(f"Index {namespace} dropped; re-build from VectorRecord table required.")

Design Considerations Summary

  • HNSW vs IVF: HNSW for dynamic datasets and low-latency queries; IVF for large static datasets with memory constraints.
  • Pre-filter vs post-filter: choose based on filter selectivity; brute-force pre-filter for highly selective filters.
  • Quantization: int8 scalar quantization gives 4x storage reduction with minimal recall loss — use for cost reduction at scale.
  • Namespaces: isolate indexes per use case to avoid cross-contamination and enable independent model versioning.
  • Re-indexing: model upgrades require full re-embed and re-index; run dual indexes during transition to avoid downtime.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top