Embeddings and Vector Databases Explained

⏱ 6 min read

Embeddings are the lingua franca of modern AI applications. They power semantic search, RAG, recommendation systems, duplicate detection, and anomaly detection. If you’re working in ML in 2026, understanding embeddings and how to store and search them at scale is foundational knowledge.

What the Interviewer Is Testing

Do you understand what an embedding is geometrically, not just what it does? Can you choose the right embedding model for a use case, implement similarity search, and explain the trade-offs between exact and approximate nearest neighbor search? Can you design a system that uses embeddings at scale?

What Are Embeddings?

An embedding is a dense, fixed-size vector of floating point numbers that represents the semantic content of an input — text, image, audio, or a user’s behavior. Similar inputs produce similar (close-together) vectors. Dissimilar inputs produce distant vectors.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

sentences = [
    "How do I reset my password?",
    "I forgot my login credentials",
    "What is the capital of France?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 1024) — three 1024-dimensional vectors

# Semantic similarity: sentences[0] and sentences[1] are close
# sentences[2] is far from both

The key property: the vector space encodes meaning. “king” – “man” + “woman” ≈ “queen” — the famous word2vec demonstration. Modern sentence embeddings capture this across phrases, sentences, and documents.

How Embeddings Are Generated

Text embeddings: A transformer encoder processes the input text and produces a contextual representation for each token. The final embedding is typically the [CLS] token representation or the mean of all token representations. Fine-tuning on sentence similarity tasks (contrastive learning with positive/negative pairs) produces embeddings where semantically similar sentences cluster together.

Word2Vec / GloVe (legacy): Predict surrounding words (skip-gram) or predict center word from context (CBOW). Fast, lightweight, but no context — “bank” has the same vector whether it’s a river bank or a financial bank. Rarely used as the primary embedding in new systems.

Image embeddings: A CNN or Vision Transformer encodes the image. The penultimate layer before the classification head produces a semantic embedding — similar images (dogs) cluster together regardless of lighting, angle, or size.

User/item embeddings (recommendation): Matrix factorization or neural collaborative filtering learns embeddings for users and items from interaction history. The dot product of user embedding and item embedding predicts affinity.

Similarity Metrics

Three common similarity metrics — the choice depends on whether embeddings are normalized:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a, b):
    return np.dot(a, b)

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

# If embeddings are L2-normalized (unit length), cosine similarity == dot product
# Most modern embedding models output L2-normalized vectors
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
assert abs(cosine_similarity(a_norm, b_norm) - dot_product(a_norm, b_norm)) < 1e-6

Cosine similarity: Measures angle between vectors, ignores magnitude. Ranges from -1 (opposite) to 1 (identical). Best for text and semantic similarity — you care about direction, not length.

Dot product: Faster than cosine (no normalization). Equals cosine when vectors are normalized. Used in recommendation systems where magnitude can encode popularity.

Euclidean distance: Straight-line distance. Sensitive to magnitude. Preferred for image embeddings and when absolute position in the space matters.

Approximate Nearest Neighbor (ANN) Search

Exact nearest neighbor search (brute force: compute distance to every vector) is O(N × d) — at 100M vectors of dimension 1024, this is 100 billion multiply-adds per query. Too slow for real-time use.

ANN algorithms trade a small recall penalty for orders-of-magnitude faster search:

HNSW (Hierarchical Navigable Small World): A multi-layer graph where each layer is a smaller, coarser version of the full dataset. Search starts at the top layer (few nodes, coarse graph) and drills down to the bottom layer (all nodes, fine graph). Provides >99% recall at <1ms for millions of vectors.

import hnswlib

# Build index
index = hnswlib.Index(space='cosine', dim=1024)
index.init_index(max_elements=1_000_000, ef_construction=200, M=16)
index.add_items(embeddings, ids=list(range(len(embeddings))))

# Query
index.set_ef(50)  # higher ef = better recall, slower
labels, distances = index.knn_query(query_embedding, k=10)

IVF (Inverted File Index): Cluster vectors into N_cells clusters (k-means). At query time, search only the nearest K clusters rather than all vectors. Faster but lower recall than HNSW. FAISS uses IVF as its primary index type.

Product Quantization (PQ): Compress each vector from 1024 floats (4096 bytes) to a small integer code (64 bytes). Enables storing billions of embeddings in RAM. Combined with IVF in FAISS as IVF+PQ for billion-scale search.

Vector Databases in Practice

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from sentence_transformers import SentenceTransformer

client = QdrantClient(host="localhost", port=6333)
encoder = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Create collection
client.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

# Index documents
documents = [{"id": i, "text": text, "source": src} for i, (text, src) in enumerate(docs)]
embeddings = encoder.encode([d["text"] for d in documents])

client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=d["id"], vector=emb.tolist(), payload={"source": d["source"], "text": d["text"]})
        for d, emb in zip(documents, embeddings)
    ]
)

# Query with metadata filter
query_embedding = encoder.encode(["How do I cancel my subscription?"])[0]
results = client.search(
    collection_name="docs",
    query_vector=query_embedding.tolist(),
    query_filter={"must": [{"key": "source", "match": {"value": "billing_docs"}}]},
    limit=5,
    with_payload=True,
)
for r in results:
    print(f"Score: {r.score:.3f} | {r.payload['text'][:100]}")

Embedding Dimensions and Tradeoffs

Dimensions	Storage per vector	Quality	Use case
256	1 KB	Low	High-scale, coarse similarity (product categories)
768	3 KB	Good	Standard NLP tasks, BERT-base
1024	4 KB	Very good	Production RAG, semantic search
3072	12 KB	Best (OpenAI large)	Highest quality, can truncate to 1536 or 256

OpenAI’s text-embedding-3 models support “Matryoshka” embeddings — you can truncate to any dimension and the model degrades gracefully. Store at 3072, search with 256 for fast ANN, re-rank top-50 at full 3072 dimensions. This gives near-full-quality results at fast-ANN speed.

Production Use Cases

Semantic search: User searches “noise canceling headphones under $200” — keyword search misses “ANC earbuds budget.” Embedding similarity catches it. Used in: Spotify (track search), LinkedIn (job search), GitHub Copilot (code context retrieval).

RAG: Retrieve relevant document chunks to augment LLM context. See the RAG vs Fine-tuning guide for full details.

Recommendation: User embedding × item embedding → predicted affinity. At inference: find items closest to the user’s embedding using ANN search. YouTube, Netflix, Amazon all use this pattern at varying levels of sophistication.

Duplicate detection: Two support tickets with cosine similarity >0.95 are likely duplicates. Route to same agent, don’t open duplicate incident. Cheaper than LLM-based comparison at scale.

Anomaly detection: Compute embedding of network log line. Find its K nearest neighbors in the historical embedding index. If all neighbors are far away, it’s an anomaly.

Common Interview Mistakes

Confusing embedding dimensions with the number of embeddings — a database of 100M documents needs 100M vectors, each of dimension 1024
Not mentioning that the same model must be used for indexing and querying
Choosing exact NN search for millions of vectors — brute force is too slow
Ignoring metadata filtering — retrieved results must be permissioned and filtered before showing to users
Treating cosine similarity as a probability — 0.85 cosine doesn’t mean “85% likely to be the right answer”

What is RAG? — embeddings power the retrieval layer; chunk embeddings are stored in vector DBs and searched at inference time
How Transformer Models Work — transformers generate embeddings; the [CLS] token representation or mean-pooled output is the embedding
Fine-tuning LLMs vs Training from Scratch — embedding models can be fine-tuned with contrastive loss on domain-specific similarity pairs
Feature Selection and Dimensionality Reduction — PCA and t-SNE can reduce embedding dimensions for visualization; TruncatedSVD for sparse embeddings

See also: NLP Interview Questions — contextual embeddings vs. static word2vec/GloVe; how BERT’s hidden states produce the embeddings stored in vector databases.

See also: ML System Design: Build a Search Ranking System — two-tower models embed queries and items for ANN retrieval; the same HNSW index powers Stage 1 candidate generation.