What Is Search Personalization?
Search personalization adjusts ranking and result selection based on an individual user context rather than returning the same ranked list to every user. A query for “python” from a data scientist should surface different results than the same query from a web developer. Personalization is achieved by blending general relevance signals with user history embeddings and real-time session context, then re-ranking results at query time.
Requirements
Functional Requirements
- Retrieve a user history embedding that captures long-term interest patterns from past clicks, dwells, and explicit actions.
- Build a session context vector from the current session query sequence and recent interactions.
- Re-rank search results by combining base relevance scores with personalization scores.
- Support cold-start: new users or anonymous sessions fall back to population-level trending signals.
- Allow users to reset personalization history.
Non-Functional Requirements
- Personalization scoring must add less than 20 ms to total query latency.
- User embeddings must reflect history from the past 90 days; older signals decay.
- The re-ranking model must be updatable without downtime.
Data Model
User Interest Profile
- user_id — primary key.
- interest_vector — dense float array (128-512 dims), updated by an offline embedding job nightly.
- top_categories — sparse list of (category_id, weight) pairs for interpretability and fast filtering.
- last_updated_at — used to detect stale profiles.
Session Context
- session_id, user_id.
- query_sequence — ordered list of query strings in this session.
- clicked_item_ids — items interacted with in this session.
- session_vector — running average embedding of session queries, updated per query.
Search Result Candidate
- item_id, base_score — from the core retrieval engine (BM25 + semantic similarity).
- personalization_score — dot product of item embedding and blended user+session vector.
- final_score — weighted combination.
Core Algorithm: Re-ranking Pipeline
Step 1 — Profile Retrieval
At query time, fetch the user interest profile from a Redis cache (key: profile:{user_id}, TTL 1 hour). On cache miss, fall back to the feature store. For anonymous sessions, use a zero vector or a population average vector for cold-start.
Step 2 — Session Context Update
Fetch the current session context. Encode the new query using a lightweight bi-encoder (quantized to INT8 for speed). Update the session vector as a recency-weighted running average: session_v = 0.7 * session_v + 0.3 * query_v. Store the updated session context in Redis with a session TTL.
Step 3 — Blended User Vector
Blend the long-term profile and short-term session signals: blended_v = alpha * profile_v + (1 - alpha) * session_v. Alpha defaults to 0.6 but is tunable per query type; navigational queries weight session context higher, exploratory queries weight historical interest more.
Step 4 — Personalization Scoring
For each candidate in the top-K retrieval set (typically 100-200 items), compute the dot product of the item embedding and blended_v. Items without embeddings receive a score of 0 (neutral re-ranking). Combine: final_score = (1 - beta) * base_score + beta * personalization_score. Return the top-N by final_score.
Step 5 — Diversity Injection
Apply Maximal Marginal Relevance (MMR) to the re-ranked list to prevent the result set from collapsing to a single sub-topic. MMR balances relevance against pairwise similarity of already-selected results.
API Design
- GET /search?q=python&uid=X&sid=Y&limit=20 — returns personalized ranked results with item_id, title, and score breakdown.
- GET /profile/{user_id} — returns top_categories and embedding freshness for debugging.
- DELETE /profile/{user_id} — resets the interest profile; next query uses cold-start path.
- POST /feedback — ingest click or skip signal to update session context in real time.
Scalability Considerations
Store item embeddings in a vector database (Weaviate, Pinecone, pgvector) for ANN retrieval when the candidate set must itself be personalized (not just re-ranked). Cache user profiles in Redis with write-through on nightly embedding updates. Serve the re-ranking model via ONNX runtime for CPU inference within the 20 ms budget. A/B test personalization by routing a fraction of traffic to the baseline ranker and comparing click-through and dwell metrics. Version model artifacts in object storage and load new versions atomically using a shadow deployment pattern.
Summary
Search personalization requires combining offline user interest embeddings with online session context vectors, then blending both with base relevance scores at query time. The critical path — profile fetch, session update, score computation — must fit within strict latency budgets, making Redis caching and quantized model inference essential. Cold-start handling, diversity injection, and model versioning round out a production-ready design.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How are offline interest embeddings built for search personalization?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Train a two-tower model on historical (user, item) interaction pairs using implicit feedback signals. The user tower ingests interaction history and demographic features to produce a dense user embedding. Embeddings are pre-computed nightly via a batch job and stored in a key-value store keyed by user_id. At query time the stored embedding is retrieved in under 1 ms for use in retrieval and ranking.”
}
},
{
“@type”: “Question”,
“name”: “How do you blend session context with long-term interest embeddings in search personalization?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Represent the current session as an average or attention-weighted embedding of items clicked or dwelled on since session start. Blend with the long-term user embedding using a learned or heuristic interpolation: final = alpha * session_embedding + (1 – alpha) * longterm_embedding. Alpha increases as session depth grows, progressively shifting weight toward fresh in-session intent. Early in a session, long-term signals dominate.”
}
},
{
“@type”: “Question”,
“name”: “How does bi-encoder query encoding work in personalized search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A bi-encoder encodes queries and documents independently into a shared embedding space. At query time, the query text is encoded by the query encoder into a dense vector. Approximate nearest-neighbor search (e.g., HNSW via Faiss or Weaviate) retrieves top-K candidate documents by cosine similarity. The query embedding can be conditioned on or concatenated with the user embedding to shift retrieval toward personalized results without re-indexing documents.”
}
},
{
“@type”: “Question”,
“name”: “How does Maximal Marginal Relevance (MMR) provide diversity in search re-ranking?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “MMR iteratively selects results by maximizing a trade-off: score(d) = lambda * relevance(d, query) – (1-lambda) * max_similarity(d, already_selected). At each step the document with the highest MMR score is added to the result list. Lambda controls the relevance-diversity trade-off (lambda=1 is pure relevance, lambda=0 is pure diversity). This prevents the top results from being near-duplicates while preserving overall relevance.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering