Low Level Design: Multimodal Search Service

What Is a Multimodal Search Service?

A multimodal search service allows users to search using multiple input modalities simultaneously or interchangeably, most commonly image and text. For example, a user might upload a photo and refine results with a text query, or describe something in words and retrieve visually similar items. Designing this system requires aligning embeddings across modalities, building efficient indices, and carefully tuning retrieval and reranking stages.

Requirements

Functional Requirements

Accept text queries, image queries, or combined text+image queries
Return semantically relevant results regardless of input modality
Support indexing of documents that have text, images, or both
Allow hybrid search combining vector similarity with keyword matching
Support reranking of candidate results

Non-Functional Requirements

Low latency: p99 query latency under 200ms
High recall: top-10 results include the relevant item in >90% of queries
Scalable index: supports billions of documents
Model updates without full re-indexing (or with efficient incremental re-indexing)

Core Concept: Shared Embedding Space

The key to multimodal search is projecting all modalities into a single shared vector space. CLIP (Contrastive Language-Image Pretraining) is the canonical example: it trains a text encoder and an image encoder jointly so that matching text-image pairs have high cosine similarity in the same vector space.

At query time, the input (text, image, or both) is encoded into this shared space. At index time, each document's content is encoded and stored. Retrieval becomes a nearest-neighbor search in this shared embedding space.

High-Level Architecture

Embedding Service: encodes text and images into shared-space vectors using a CLIP-style model
Vector Index: stores document embeddings, supports approximate nearest neighbor (ANN) search (e.g., HNSW via Faiss, Weaviate, Qdrant, Pinecone)
Keyword Index: inverted index for BM25/lexical search (e.g., Elasticsearch, OpenSearch)
Hybrid Fusion Layer: merges vector and keyword results (Reciprocal Rank Fusion or learned fusion)
Reranker: cross-encoder model that scores query-document pairs for final ordering
Query Gateway: routes queries, handles modality detection, calls embedding and retrieval layers

Embedding Alignment

Alignment is the hard part. A CLIP model is pretrained on large image-caption datasets, but fine-tuning on domain-specific data dramatically improves retrieval quality. Key considerations:

Contrastive loss: pull matching pairs together, push non-matching pairs apart (InfoNCE loss)
Hard negative mining: easy negatives do not train a strong model; mine near-misses from the index
Temperature scaling: softmax temperature controls how sharp the similarity distribution is
Dimensionality: common embedding sizes are 512, 768, or 1024 dims; larger = higher quality but more memory and compute

Indexing Pipeline

Documents are ingested through an offline pipeline:

Extract text and images from each document
Run each through the appropriate encoder (text encoder or image encoder) to get embeddings
Store embeddings in the vector index with document ID
Also index text fields in the keyword index
Store metadata (document ID, title, URL, thumbnail) in a key-value store for serving

For large corpora, indexing is batched and parallelized. HNSW indices can be built incrementally but may require periodic compaction for optimal graph quality.

Query Processing

At query time:

Detect input modality (text, image upload, or both)
Encode input into the shared embedding space
Run ANN search against the vector index to get top-K candidates (e.g., K=100)
Run keyword search in parallel if text is present
Merge results via Reciprocal Rank Fusion (RRF): score(doc) = sum(1 / (rank_i + 60))
Run reranker on top-N merged candidates (e.g., N=50) to produce final top-10

Hybrid Keyword + Vector Search

Pure vector search has poor recall for exact-match queries (product IDs, proper nouns, rare terms). Hybrid search addresses this:

BM25 handles exact lexical matches well
Vector search handles semantic similarity and cross-modal retrieval
RRF fusion is parameter-free and robust; learned fusion weights can be tuned offline
The keyword index must also store image alt text, captions, and OCR output from images

Reranking

First-stage retrieval optimizes for recall. Reranking optimizes for precision:

Cross-encoder: jointly encodes query and document, attends across both; much more accurate than bi-encoder but 10-100x slower
Run only on a small candidate set (top 50) to keep latency acceptable
Can incorporate additional signals: popularity, freshness, user engagement, diversity
For multimodal reranking, the model must accept image+text pairs

Scalability Considerations

Index sharding: shard the vector index by document ID; query each shard and merge results
Embedding caching: cache embeddings for frequent queries; cache text embeddings aggressively (deterministic), image embeddings less so
Model serving: run embedding models on GPU with batching; use TorchServe, Triton, or similar
Async indexing: new documents are queued and indexed asynchronously; slight index lag is acceptable
Quantization: INT8 or FP16 quantization of embeddings reduces memory by 2-4x with minimal recall loss

Model Updates and Re-indexing

When the embedding model is updated, all existing embeddings are stale. Strategies:

Full re-index: most accurate but expensive; acceptable for small corpora or infrequent updates
Dual index: run old and new model in parallel, gradually shift traffic, then cut over
Distillation alignment: train the new model to align with the old model's embedding space to allow reuse of stored embeddings

Failure Modes and Mitigations

Embedding model timeout: return cached or degraded results; fallback to keyword-only search
Vector index unavailability: fallback to keyword index
Query modality misclassification: allow user to explicitly specify modality; default to text if ambiguous
Cross-modal mismatch: low similarity scores across modalities often indicate out-of-domain input; surface confidence scores to the caller

Interview Tips

Start by clarifying what modalities are in scope and what the latency SLA is
Explain CLIP-style alignment before jumping into infrastructure
Mention hybrid search early; interviewers often probe whether you know pure vector search has lexical recall gaps
Distinguish bi-encoder (fast, used for retrieval) from cross-encoder (accurate, used for reranking)
Discuss model versioning and re-indexing strategy; it is a common follow-up question

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is multimodal search and how does it differ from text-only search?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Multimodal search allows users to query using any combination of text, images, audio, or video and retrieve results across all those modalities. Unlike text-only search which operates entirely in a lexical or text embedding space, multimodal search projects different input types into a shared embedding space so that a photo query can surface relevant text documents and vice versa.” } }, { “@type”: “Question”, “name”: “How does CLIP-based embedding alignment enable cross-modal retrieval?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “CLIP trains a dual-encoder—one for images and one for text—using contrastive learning on matched image-caption pairs. The objective pulls the embeddings of matched pairs close together and pushes unmatched pairs apart in the shared latent space. At query time the user’s image or text is encoded by the relevant tower and the resulting vector is compared against all indexed embeddings regardless of modality, enabling direct cross-modal similarity search.” } }, { “@type”: “Question”, “name”: “How does hybrid keyword and vector search work in a multimodal search service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Hybrid search runs a sparse keyword query such as BM25 and a dense vector ANN query in parallel against the same index. Each query returns a ranked list of candidates with relevance scores. A reciprocal rank fusion or learned score combination step merges the two lists, letting keyword matching handle exact term recall while vector search handles semantic similarity. This combination typically outperforms either method alone.” } }, { “@type”: “Question”, “name”: “How is reranking applied after retrieval in a multimodal search system?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “After the retrieval stage produces a candidate set, a cross-encoder or multimodal reranker scores each candidate by jointly attending to the query and the candidate content. Cross-encoders are more accurate than bi-encoders because they consider the full interaction between query and document rather than independently computed embeddings. The reranker re-sorts the top-k candidates and a smaller final set is returned to the user, improving precision at the cost of additional latency.” } } ] }