Vector Database Low-Level Design: Embeddings Storage, ANN Search, and Hybrid Filtering

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between HNSW and IVF indexes for approximate nearest neighbor search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HNSW (Hierarchical Navigable Small World) is a graph-based index that builds a layered proximity graph. Search starts at the top layer (few nodes, long-range connections) and greedily descends to lower layers for precision, achieving O(log N) query time with high recall. It requires no training data and handles dynamic inserts well, but consumes significant memory (each node stores M neighbor pointers per layer). IVF (Inverted File) partitions the vector space into K clusters using k-means, stores posting lists per cluster, and at query time probes the nprobe nearest cluster centroids. IVF is more memory-efficient and faster for very large static datasets, but requires training (k-means) and its recall degrades if the query vector's true neighbors are spread across many clusters.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between pre-filter and post-filter hybrid search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In hybrid search (vector ANN + metadata filter), pre-filter applies the metadata filter first and then runs ANN only on the filtered subset. This guarantees recall within the filtered set but degrades ANN performance if the subset is small, because the ANN index was built on the full dataset and the small subset may not be well-represented in the graph structure. Post-filter runs ANN on the full index first (retrieves top-K candidates), then filters by metadata. Post-filter is fast but may return fewer than K results if many candidates fail the filter. Most production systems use pre-filter for high-selectivity filters (less than 1% of data) via brute-force on the subset, and post-filter with over-fetching (retrieve top-K * multiplier) for low-selectivity filters.”
}
},
{
“@type”: “Question”,
“name”: “When should you use cosine similarity versus dot product versus L2 distance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cosine similarity measures the angle between vectors, ignoring magnitude — use it when embedding magnitude is not meaningful, which is the case for most text embedding models (OpenAI, Cohere) where embeddings are L2-normalized before storage. Dot product is equivalent to cosine similarity when vectors are unit-normalized, but when vectors are NOT normalized it rewards both direction alignment and magnitude. Recommendation models sometimes use un-normalized dot product to encode item popularity in magnitude. L2 (Euclidean) distance measures absolute spatial distance — appropriate for embeddings where magnitude encodes quantity (e.g., some image embeddings or embeddings from models that do not normalize output). When in doubt, normalize embeddings and use cosine or dot product.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle re-indexing when the embedding model changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When an embedding model changes, all existing vectors are incompatible with new query vectors because the embedding space is different. The correct approach is: (1) create a new namespace or index for the new model version; (2) re-embed all entities using the new model in batches (this is the expensive step); (3) write new vectors into the new namespace; (4) once re-embedding is complete and verified, update the query routing to use the new namespace; (5) delete the old namespace. Model version metadata should be stored alongside each vector record so you can identify which records need re-embedding. For large corpora, run both indexes in parallel during the transition and route queries to the old index for entities not yet re-embedded.”
}
}
]
}

A vector database stores high-dimensional embedding vectors and enables fast approximate nearest neighbor (ANN) search — finding the top-K vectors most similar to a query vector. It is the backbone of semantic search, RAG (retrieval-augmented generation), recommendation systems, and multimodal search.

Embedding Storage

Embeddings are fixed-dimension float32 arrays. Common dimensions: 384 (MiniLM), 768 (BERT), 1536 (text-embedding-3-small), 3072 (text-embedding-3-large). A single float32 vector of 1536 dimensions occupies 6 KB of raw storage.

Each vector record stores:

vector: raw float32 bytes (or quantized int8 for compression).
entity_id: foreign key linking back to the original document, image, or product.
namespace: logical partition — separate index per use case (documents, images, users).
model_version: which embedding model produced this vector.
metadata: filterable fields (category, language, user_id, date) stored as JSONB alongside the vector.

Scalar quantization: converting float32 to int8 reduces storage by 4x with a small recall penalty (~1–2%). Binary quantization (1 bit per dimension) reduces by 32x at larger recall cost, suitable for first-pass retrieval followed by float32 re-ranking.

HNSW Index: Structure and Parameters

HNSW builds a layered proximity graph:

Layer 0: contains all nodes with short-range edges.
Layers 1..L: exponentially fewer nodes; each node has long-range connections for fast global navigation.

Key parameters:

M: number of bidirectional connections per node per layer. Higher M → better recall, more memory (memory ≈ M * 2 * num_vectors * 8 bytes for pointers). Typical M = 16–64.
ef_construction: search breadth during index build. Higher ef_construction → better graph quality → better recall at query time, but slower build. Typical ef_construction = 200–400.
ef_search (ef at query time): candidate list size during search. Higher ef_search → better recall, slower query. Tune ef_search to hit a target recall (e.g., 0.95) at acceptable latency.

Search: enter at the top layer at a fixed entry point, greedily move to the neighbor closest to the query, repeat per layer, descend to layer 0, collect ef_search candidates, return top-K by distance.

IVF Index: Cluster Centroids and Posting Lists

IVF runs k-means clustering on the training set to produce K cluster centroids. Each vector is assigned to its nearest centroid. At query time:

Compute distance from query vector to all K centroids (fast, K is small — typically 1024–65536).
Select the nprobe nearest centroids.
Scan all vectors in those nprobe posting lists, compute distances, return top-K.

IVF excels for large static datasets (billions of vectors) where the training data is representative. Recall depends on nprobe: higher nprobe approaches brute-force recall but increases latency. IVF+PQ (product quantization) further compresses vectors within each posting list for memory efficiency.

Hybrid Search: Pre-filter vs Post-filter

Pre-filter strategy: apply metadata filter first (e.g., WHERE category = 'electronics' AND language = 'en'), then run ANN on the resulting entity_id set. For small result sets, brute-force L2 scan is faster than traversing the HNSW graph on a sparse subset. Guarantees recall within the filter but is slow for large filtered sets.

Post-filter strategy: run ANN on the full index, retrieve top-K * multiplier candidates, then apply metadata filter. Fast but may yield fewer than K results. Mitigated by over-fetching (retrieve 10x K and filter down).

Hybrid approach: use the filter's selectivity to decide strategy at query time. If estimated filtered set size is under a threshold (e.g., 10,000 vectors), use brute-force pre-filter. Otherwise, use post-filter with over-fetching.

Distance Metrics

Cosine similarity: angle between vectors; magnitude-invariant. Standard for L2-normalized text embeddings.
Dot product: equivalent to cosine when vectors are unit-normalized; encodes magnitude when not. Used in recommendation where magnitude encodes popularity.
L2 (Euclidean) distance: absolute spatial distance. Used when embedding magnitude is semantically meaningful.

Most embedding APIs (OpenAI, Cohere) return L2-normalized vectors. Using cosine or dot product on these is equivalent, but dot product is computationally cheaper (no normalization step at query time).

SQL DDL

-- Vector records with metadata for hybrid search
CREATE TABLE VectorRecord (
    id            BIGSERIAL PRIMARY KEY,
    namespace     VARCHAR(128)  NOT NULL,
    entity_id     BIGINT        NOT NULL,
    vector        BYTEA         NOT NULL,   -- float32 array serialized
    metadata      JSONB         NOT NULL DEFAULT '{}',
    model_version VARCHAR(64)   NOT NULL,
    created_at    TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (namespace, entity_id, model_version)
);

CREATE INDEX idx_vector_namespace ON VectorRecord (namespace, model_version);
CREATE INDEX idx_vector_metadata ON VectorRecord USING GIN (metadata);

-- HNSW index configuration metadata
CREATE TABLE HNSWIndex (
    id               BIGSERIAL PRIMARY KEY,
    namespace        VARCHAR(128)  NOT NULL,
    m                INTEGER       NOT NULL,
    ef_construction  INTEGER       NOT NULL,
    dimension        INTEGER       NOT NULL,
    created_at       TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (namespace)
);

Python: Core Operations

import numpy as np
import struct
from typing import Any

try:
    import hnswlib
    _HAS_HNSW = True
except ImportError:
    _HAS_HNSW = False

# In-memory index registry: namespace -> hnswlib.Index
_indexes: dict[str, Any] = {}

def _get_or_create_index(namespace: str, dim: int, max_elements: int = 1_000_000) -> Any:
    if namespace not in _indexes and _HAS_HNSW:
        idx = hnswlib.Index(space='cosine', dim=dim)
        idx.init_index(max_elements=max_elements, ef_construction=200, M=16)
        idx.set_ef(50)
        _indexes[namespace] = idx
    return _indexes.get(namespace)

def upsert_vector(namespace: str, entity_id: int, vector: list[float], metadata: dict) -> None:
    """Insert or update a vector in the given namespace."""
    vec = np.array(vector, dtype=np.float32)
    vec = vec / (np.linalg.norm(vec) + 1e-9)  # L2 normalize
    idx = _get_or_create_index(namespace, len(vector))
    if idx:
        idx.add_items(vec.reshape(1, -1), [entity_id])

def search(namespace: str, query_vector: list[float], top_k: int,
           filter_fn=None) -> list[dict]:
    """ANN search with optional post-filter function."""
    q = np.array(query_vector, dtype=np.float32)
    q = q / (np.linalg.norm(q) + 1e-9)
    idx = _indexes.get(namespace)
    if not idx:
        return []
    fetch_k = top_k * 10 if filter_fn else top_k
    labels, distances = idx.knn_query(q.reshape(1, -1), k=min(fetch_k, idx.element_count))
    results = []
    for label, dist in zip(labels[0], distances[0]):
        record = {'entity_id': int(label), 'score': float(1 - dist)}
        if filter_fn is None or filter_fn(record):
            results.append(record)
            if len(results) >= top_k:
                break
    return results

def rebuild_index(namespace: str) -> None:
    """Drop and rebuild the HNSW index for a namespace from stored vectors."""
    # In production: load all VectorRecord rows for namespace, re-add to fresh index
    _indexes.pop(namespace, None)
    print(f"Index {namespace} dropped; re-build from VectorRecord table required.")

Design Considerations Summary

HNSW vs IVF: HNSW for dynamic datasets and low-latency queries; IVF for large static datasets with memory constraints.
Pre-filter vs post-filter: choose based on filter selectivity; brute-force pre-filter for highly selective filters.
Quantization: int8 scalar quantization gives 4x storage reduction with minimal recall loss — use for cost reduction at scale.
Namespaces: isolate indexes per use case to avoid cross-contamination and enable independent model versioning.
Re-indexing: model upgrades require full re-embed and re-index; run dual indexes during transition to avoid downtime.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the HNSW algorithm enable approximate nearest neighbor search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HNSW builds a layered graph where upper layers have long-range connections for fast navigation and lower layers have dense local connections for precision; a search starts at the top layer, greedily descends to the entry point, then performs a beam search on the bottom layer.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between pre-filter and post-filter in hybrid search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Pre-filter applies metadata filters first, reducing the vector search space to a qualifying subset; post-filter performs ANN search first then filters results by metadata; pre-filter is more precise but may miss results if the filtered set is too small; post-filter may require fetching more candidates to ensure K results after filtering.”
}
},
{
“@type”: “Question”,
“name”: “Why is cosine similarity preferred over L2 for text embeddings?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cosine similarity measures the angle between vectors, making it invariant to vector magnitude; text embeddings from different sentences may have different norms but similar directions; L2 distance conflates directional similarity with magnitude differences.”
}
},
{
“@type”: “Question”,
“name”: “How is re-indexing handled when the embedding model changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “All entities must be re-embedded with the new model and re-indexed; a new index namespace is created, populated, and validated before traffic is switched; the old namespace is kept as a fallback during migration.”
}
}
]
}