Embedding Service Low-Level Design: Model Inference, Vector Store, and Similarity Search

What Is an Embedding Service?

An embedding service converts raw inputs — text passages, images, or structured records — into dense fixed-dimensional floating-point vectors that capture semantic meaning. These vectors are stored in a vector index and queried via approximate nearest-neighbor (ANN) search to power semantic search, recommendation, deduplication, and retrieval-augmented generation pipelines. Designing this service at the low level involves batched model inference, efficient vector storage, ANN index construction and querying, and a caching layer to avoid redundant inference on repeated inputs.

Requirements

Functional Requirements

Accept text or other inputs and return embedding vectors using a configurable model (e.g., sentence-transformers, OpenAI ada-002, or a fine-tuned internal model).
Store embeddings in a vector index keyed by a document or entity ID.
Serve approximate nearest-neighbor queries returning the top-K most similar vectors to a query input.
Cache embeddings for repeated inputs to avoid redundant inference calls.
Support batch ingestion of millions of documents during initial indexing.

Non-Functional Requirements

Single embedding inference latency under 20 ms per item in batch.
ANN query p99 latency under 30 ms for a 100-million-vector index.
Throughput of 10,000 embedding requests per second at steady state.
Index build for 100 million vectors completes within 6 hours.

Data Model

The EmbeddingRecord stores document ID, model ID and version, input content hash (SHA-256, used as cache key), embedding vector (float32 array, dimension d), and creation timestamp. Vectors are stored in a vector database (Pinecone, Weaviate, or self-hosted Faiss/HNSWlib) with document ID as the primary key and metadata fields for filtered search. A ModelRegistry table stores model ID, model type, serving endpoint or artifact path, input modality, output dimension, and normalization configuration. An EmbeddingCache in Redis maps content-hash to serialized vector bytes with a configurable TTL (default 7 days), avoiding model inference for duplicate or recently seen inputs.

Core Algorithms

Batched Model Inference

Individual embedding requests are assembled into micro-batches before invoking the model. An inference gateway accumulates requests in a queue and dispatches a batch when either the batch size reaches a maximum (e.g., 64 items) or a maximum wait time elapses (e.g., 5 ms), whichever comes first. This dynamic batching strategy amortizes GPU kernel launch overhead and increases GPU utilization from under 10% (for single-item requests) to over 80%. For GPU-resident models the inference server (Triton Inference Server or TorchServe) handles dynamic batching internally; the gateway controls concurrency and timeout budgets.

Cache Layer

Before invoking the model, the service computes a deterministic cache key as SHA-256(model_id + normalized_input_text). It performs a Redis GET; on hit, it deserializes and returns the cached vector in under 1 ms. On miss, it proceeds to model inference, then asynchronously writes the result back to Redis using a pipeline SET with TTL. Cache hit rates of 40-70% are typical for content platforms where the same article or product description is embedded repeatedly as metadata changes trigger re-embedding checks.

ANN Index Construction and Querying

The vector index uses HNSW (Hierarchical Navigable Small World) graphs due to their favorable recall-speed tradeoff. Index construction parameters M=16 (edges per node) and ef_construction=200 (search width during build) produce a high-quality graph. At query time ef_search=100 achieves 97%+ recall at under 5 ms per query for 100-million-vector indexes. Vectors are L2-normalized before insertion so that inner product search is equivalent to cosine similarity, enabling use of SIMD-optimized inner product kernels (AVX-512 on x86, NEON on ARM).

API Design

The embedding API exposes POST /v1/embed accepting a JSON body with model_id, inputs (array of strings, max 256 per request), and optional normalize flag. The response returns an array of vectors in the same order as inputs. The similarity search API at POST /v1/search accepts a query input or pre-computed vector, top_k, optional metadata filters, and optional namespace to scope the search to a subset of the index. A POST /v1/upsert endpoint accepts document ID, input text, and optional metadata, computing or reusing a cached embedding and upserting it into the vector index. A DELETE /v1/vectors/{id} endpoint removes a vector and its cache entry.

Scalability and Infrastructure

The inference layer scales horizontally with GPU node autoscaling triggered by queue depth. Each GPU node runs multiple model replicas to maximize GPU memory utilization. The vector index is sharded across multiple index servers by document ID hash; fan-out queries go to all shards in parallel and results are merged with a k-way heap merge in the query coordinator. Index shards are replicated 2x for read availability. Bulk ingestion runs as an offline Spark job: input documents are read from object storage, embeddings are computed using the same model via a Spark ML transformer wrapper, and vectors are bulk-loaded into the index using the vector database batch upsert API, which bypasses the online serving path and writes directly to index files for 10x higher throughput.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does batched model inference improve embedding service throughput?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Batched inference groups multiple embedding requests into a single forward pass through the model, amortizing GPU kernel launch overhead and maximizing hardware utilization. A dynamic batching layer collects requests within a short time window (e.g., 5'10 ms) and pads sequences to a uniform length before dispatching, yielding significantly higher QPS than single-request inference.”
}
},
{
“@type”: “Question”,
“name”: “What is an HNSW vector index and why is it used for ANN search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hierarchical Navigable Small World (HNSW) is a graph-based data structure that organizes vectors in a multi-layer proximity graph. During approximate nearest neighbor (ANN) search it greedily traverses the graph from a coarse entry point down through finer layers, achieving sub-linear query time with high recall. It's preferred over exact k-NN because million-scale indexes remain searchable in single-digit milliseconds.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement ANN search in a low-latency embedding service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ANN search is typically served by an in-process or sidecar vector store (FAISS, hnswlib, Milvus). The query embedding is computed first, then passed to the index's search API with a desired k and an ef (exploration factor) tuned for the recall/latency trade-off. Results are re-ranked or filtered by metadata before returning to the caller.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle embedding cache staleness?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Embeddings are keyed by a hash of the input text and the model version. When the model is retrained or fine-tuned, a new version tag invalidates all existing cache entries lazily. TTL-based eviction handles content that changes over time, while a background job can proactively re-embed high-traffic items after a model update so the cache is warm before the old entries expire.”
}
}
]
}