How are offline interest embeddings built for search personalization?

Train a two-tower model on historical (user, item) interaction pairs using implicit feedback signals. The user tower ingests interaction history and demographic features to produce a dense user embedding. Embeddings are pre-computed nightly via a batch job and stored in a key-value store keyed by user_id. At query time the stored embedding is retrieved in under 1 ms for use in retrieval and ranking.

How do you blend session context with long-term interest embeddings in search personalization?

Represent the current session as an average or attention-weighted embedding of items clicked or dwelled on since session start. Blend with the long-term user embedding using a learned or heuristic interpolation: final = alpha * session_embedding + (1 - alpha) * longterm_embedding. Alpha increases as session depth grows, progressively shifting weight toward fresh in-session intent. Early in a session, long-term signals dominate.

How does bi-encoder query encoding work in personalized search?

A bi-encoder encodes queries and documents independently into a shared embedding space. At query time, the query text is encoded by the query encoder into a dense vector. Approximate nearest-neighbor search (e.g., HNSW via Faiss or Weaviate) retrieves top-K candidate documents by cosine similarity. The query embedding can be conditioned on or concatenated with the user embedding to shift retrieval toward personalized results without re-indexing documents.

How does Maximal Marginal Relevance (MMR) provide diversity in search re-ranking?

MMR iteratively selects results by maximizing a trade-off: score(d) = lambda * relevance(d, query) - (1-lambda) * max_similarity(d, already_selected). At each step the document with the highest MMR score is added to the result list. Lambda controls the relevance-diversity trade-off (lambda=1 is pure relevance, lambda=0 is pure diversity). This prevents the top results from being near-duplicates while preserving overall relevance.

Search Personalization Service Low-Level Design: User History, Session Context, and Re-ranking

⏱ 5 min read

What Is Search Personalization?

Search personalization adjusts ranking and result selection based on an individual user context rather than returning the same ranked list to every user. A query for “python” from a data scientist should surface different results than the same query from a web developer. Personalization is achieved by blending general relevance signals with user history embeddings and real-time session context, then re-ranking results at query time.

Requirements

Functional Requirements

Retrieve a user history embedding that captures long-term interest patterns from past clicks, dwells, and explicit actions.
Build a session context vector from the current session query sequence and recent interactions.
Re-rank search results by combining base relevance scores with personalization scores.
Support cold-start: new users or anonymous sessions fall back to population-level trending signals.
Allow users to reset personalization history.

Non-Functional Requirements

Personalization scoring must add less than 20 ms to total query latency.
User embeddings must reflect history from the past 90 days; older signals decay.
The re-ranking model must be updatable without downtime.

Data Model

User Interest Profile

user_id — primary key.
interest_vector — dense float array (128-512 dims), updated by an offline embedding job nightly.
top_categories — sparse list of (category_id, weight) pairs for interpretability and fast filtering.
last_updated_at — used to detect stale profiles.

Session Context

session_id, user_id.
query_sequence — ordered list of query strings in this session.
clicked_item_ids — items interacted with in this session.
session_vector — running average embedding of session queries, updated per query.

Search Result Candidate

item_id, base_score — from the core retrieval engine (BM25 + semantic similarity).
personalization_score — dot product of item embedding and blended user+session vector.
final_score — weighted combination.

Core Algorithm: Re-ranking Pipeline

Step 1 — Profile Retrieval

At query time, fetch the user interest profile from a Redis cache (key: profile:{user_id}, TTL 1 hour). On cache miss, fall back to the feature store. For anonymous sessions, use a zero vector or a population average vector for cold-start.

Step 2 — Session Context Update

Fetch the current session context. Encode the new query using a lightweight bi-encoder (quantized to INT8 for speed). Update the session vector as a recency-weighted running average: session_v = 0.7 * session_v + 0.3 * query_v. Store the updated session context in Redis with a session TTL.

Step 3 — Blended User Vector

Blend the long-term profile and short-term session signals: blended_v = alpha * profile_v + (1 - alpha) * session_v. Alpha defaults to 0.6 but is tunable per query type; navigational queries weight session context higher, exploratory queries weight historical interest more.

Step 4 — Personalization Scoring

For each candidate in the top-K retrieval set (typically 100-200 items), compute the dot product of the item embedding and blended_v. Items without embeddings receive a score of 0 (neutral re-ranking). Combine: final_score = (1 - beta) * base_score + beta * personalization_score. Return the top-N by final_score.

Step 5 — Diversity Injection

Apply Maximal Marginal Relevance (MMR) to the re-ranked list to prevent the result set from collapsing to a single sub-topic. MMR balances relevance against pairwise similarity of already-selected results.

API Design

GET /search?q=python&uid=X&sid=Y&limit=20 — returns personalized ranked results with item_id, title, and score breakdown.
GET /profile/{user_id} — returns top_categories and embedding freshness for debugging.
DELETE /profile/{user_id} — resets the interest profile; next query uses cold-start path.
POST /feedback — ingest click or skip signal to update session context in real time.

Scalability Considerations

Store item embeddings in a vector database (Weaviate, Pinecone, pgvector) for ANN retrieval when the candidate set must itself be personalized (not just re-ranked). Cache user profiles in Redis with write-through on nightly embedding updates. Serve the re-ranking model via ONNX runtime for CPU inference within the 20 ms budget. A/B test personalization by routing a fraction of traffic to the baseline ranker and comparing click-through and dwell metrics. Version model artifacts in object storage and load new versions atomically using a shadow deployment pattern.

Summary

Search personalization requires combining offline user interest embeddings with online session context vectors, then blending both with base relevance scores at query time. The critical path — profile fetch, session update, score computation — must fit within strict latency budgets, making Redis caching and quantized model inference essential. Cold-start handling, diversity injection, and model versioning round out a production-ready design.