What Is a Multimodal Search Service?
A multimodal search service allows users to search using multiple input modalities simultaneously or interchangeably, most commonly image and text. For example, a user might upload a photo and refine results with a text query, or describe something in words and retrieve visually similar items. Designing this system requires aligning embeddings across modalities, building efficient indices, and carefully tuning retrieval and reranking stages.
Requirements
Functional Requirements
- Accept text queries, image queries, or combined text+image queries
- Return semantically relevant results regardless of input modality
- Support indexing of documents that have text, images, or both
- Allow hybrid search combining vector similarity with keyword matching
- Support reranking of candidate results
Non-Functional Requirements
- Low latency: p99 query latency under 200ms
- High recall: top-10 results include the relevant item in >90% of queries
- Scalable index: supports billions of documents
- Model updates without full re-indexing (or with efficient incremental re-indexing)
Core Concept: Shared Embedding Space
The key to multimodal search is projecting all modalities into a single shared vector space. CLIP (Contrastive Language-Image Pretraining) is the canonical example: it trains a text encoder and an image encoder jointly so that matching text-image pairs have high cosine similarity in the same vector space.
At query time, the input (text, image, or both) is encoded into this shared space. At index time, each document's content is encoded and stored. Retrieval becomes a nearest-neighbor search in this shared embedding space.
High-Level Architecture
- Embedding Service: encodes text and images into shared-space vectors using a CLIP-style model
- Vector Index: stores document embeddings, supports approximate nearest neighbor (ANN) search (e.g., HNSW via Faiss, Weaviate, Qdrant, Pinecone)
- Keyword Index: inverted index for BM25/lexical search (e.g., Elasticsearch, OpenSearch)
- Hybrid Fusion Layer: merges vector and keyword results (Reciprocal Rank Fusion or learned fusion)
- Reranker: cross-encoder model that scores query-document pairs for final ordering
- Query Gateway: routes queries, handles modality detection, calls embedding and retrieval layers
Embedding Alignment
Alignment is the hard part. A CLIP model is pretrained on large image-caption datasets, but fine-tuning on domain-specific data dramatically improves retrieval quality. Key considerations:
- Contrastive loss: pull matching pairs together, push non-matching pairs apart (InfoNCE loss)
- Hard negative mining: easy negatives do not train a strong model; mine near-misses from the index
- Temperature scaling: softmax temperature controls how sharp the similarity distribution is
- Dimensionality: common embedding sizes are 512, 768, or 1024 dims; larger = higher quality but more memory and compute
Indexing Pipeline
Documents are ingested through an offline pipeline:
- Extract text and images from each document
- Run each through the appropriate encoder (text encoder or image encoder) to get embeddings
- Store embeddings in the vector index with document ID
- Also index text fields in the keyword index
- Store metadata (document ID, title, URL, thumbnail) in a key-value store for serving
For large corpora, indexing is batched and parallelized. HNSW indices can be built incrementally but may require periodic compaction for optimal graph quality.
Query Processing
At query time:
- Detect input modality (text, image upload, or both)
- Encode input into the shared embedding space
- Run ANN search against the vector index to get top-K candidates (e.g., K=100)
- Run keyword search in parallel if text is present
- Merge results via Reciprocal Rank Fusion (RRF): score(doc) = sum(1 / (rank_i + 60))
- Run reranker on top-N merged candidates (e.g., N=50) to produce final top-10
Hybrid Keyword + Vector Search
Pure vector search has poor recall for exact-match queries (product IDs, proper nouns, rare terms). Hybrid search addresses this:
- BM25 handles exact lexical matches well
- Vector search handles semantic similarity and cross-modal retrieval
- RRF fusion is parameter-free and robust; learned fusion weights can be tuned offline
- The keyword index must also store image alt text, captions, and OCR output from images
Reranking
First-stage retrieval optimizes for recall. Reranking optimizes for precision:
- Cross-encoder: jointly encodes query and document, attends across both; much more accurate than bi-encoder but 10-100x slower
- Run only on a small candidate set (top 50) to keep latency acceptable
- Can incorporate additional signals: popularity, freshness, user engagement, diversity
- For multimodal reranking, the model must accept image+text pairs
Scalability Considerations
- Index sharding: shard the vector index by document ID; query each shard and merge results
- Embedding caching: cache embeddings for frequent queries; cache text embeddings aggressively (deterministic), image embeddings less so
- Model serving: run embedding models on GPU with batching; use TorchServe, Triton, or similar
- Async indexing: new documents are queued and indexed asynchronously; slight index lag is acceptable
- Quantization: INT8 or FP16 quantization of embeddings reduces memory by 2-4x with minimal recall loss
Model Updates and Re-indexing
When the embedding model is updated, all existing embeddings are stale. Strategies:
- Full re-index: most accurate but expensive; acceptable for small corpora or infrequent updates
- Dual index: run old and new model in parallel, gradually shift traffic, then cut over
- Distillation alignment: train the new model to align with the old model's embedding space to allow reuse of stored embeddings
Failure Modes and Mitigations
- Embedding model timeout: return cached or degraded results; fallback to keyword-only search
- Vector index unavailability: fallback to keyword index
- Query modality misclassification: allow user to explicitly specify modality; default to text if ambiguous
- Cross-modal mismatch: low similarity scores across modalities often indicate out-of-domain input; surface confidence scores to the caller
Interview Tips
- Start by clarifying what modalities are in scope and what the latency SLA is
- Explain CLIP-style alignment before jumping into infrastructure
- Mention hybrid search early; interviewers often probe whether you know pure vector search has lexical recall gaps
- Distinguish bi-encoder (fast, used for retrieval) from cross-encoder (accurate, used for reranking)
- Discuss model versioning and re-indexing strategy; it is a common follow-up question
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering