What Is Content-Based Filtering?
Content-based filtering recommends items similar to those a user has already engaged with, based on the attributes of the items themselves rather than on other users behavior. If a user reads three articles about distributed systems, the engine finds other articles with similar topics, keywords, and structure. It does not require data from other users, which makes it cold-start friendly for new users — but it can over-specialize, creating a filter bubble.
Data Model
items (
item_id BIGINT PRIMARY KEY,
title VARCHAR(255),
body TEXT,
category VARCHAR(64),
tags JSON, -- e.g. ['distributed-systems', 'databases']
published_at TIMESTAMP
);
item_features (
item_id BIGINT PRIMARY KEY,
tfidf_vector BLOB, -- sparse TF-IDF vector
dense_vector BLOB, -- dense embedding, e.g. 384-dim from sentence-transformer
updated_at TIMESTAMP
);
user_profiles (
user_id BIGINT PRIMARY KEY,
interest_vector BLOB, -- aggregated embedding of engaged items
updated_at TIMESTAMP
);
interactions (
user_id BIGINT,
item_id BIGINT,
weight FLOAT, -- view=1.0, like=2.0, share=3.0, skip=-0.5
created_at TIMESTAMP,
INDEX(user_id, created_at)
);
Feature Extraction
Two main approaches for representing item content:
- TF-IDF (Term Frequency-Inverse Document Frequency): Classic sparse representation. For each item, tokenize and stem the text, compute TF-IDF weights across the corpus. Similarity is cosine distance over sparse vectors. Fast to compute, interpretable, works well for keyword-heavy domains.
- Dense embeddings: Use a pretrained transformer (e.g., sentence-transformers, OpenAI Ada, or a fine-tuned BERT variant) to encode item text into a fixed-size dense vector. Captures semantic similarity — two articles about the same concept in different words will be close in embedding space. Requires more compute but dramatically outperforms TF-IDF on semantic tasks.
In practice, a hybrid approach works best: use dense embeddings for semantic retrieval, augment with TF-IDF or tag-overlap signals as re-ranking features.
User Profile Construction
The user interest vector is a weighted average of the item vectors the user has engaged with:
user_vector = sum(weight_i * item_vector_i) / sum(weight_i)
for all (item_i, weight_i) in user interactions
Recent interactions are up-weighted using an exponential decay: weight_i *= exp(-lambda * days_since_interaction). This ensures that a user who pivoted from sports to tech content gets tech recommendations, not sports.
Recommendation Workflow
- Fetch user_vector from user_profiles (recomputed on write or in batch).
- Run ANN search (FAISS, Weaviate, or Qdrant) over item_features to retrieve top-K candidates by cosine similarity.
- Filter out already-seen items and apply business rules (recency cap, category diversity).
- Re-rank using a lightweight model incorporating tag overlap, freshness, and engagement rate signals.
- Return final list; cache per user with TTL ~15 minutes.
Cold Start Problem
Content-based filtering is significantly more resilient to cold start than collaborative filtering:
- New users: Even without interaction history, onboarding questions (preferred categories, topics) can populate an initial interest vector. One explicit preference is enough to start serving relevant content.
- New items: As soon as item metadata is ingested and the embedding computed, the item is retrievable. No interaction history needed. This is a major advantage over collaborative filtering for publishers with fast content cycles.
- Limitation: Without cross-user signal, content-based systems cannot surface serendipitous discoveries. A user who has only read Python tutorials will never see a relevant Rust article unless categories overlap.
Scalability Considerations
- Embedding index size: At 384 floats (float32) per item, 10 million items require ~15 GB. Use product quantization (PQ) in FAISS to compress vectors 8-16x with minimal recall loss.
- Embedding freshness: Run embedding inference in a streaming pipeline (Kafka + workers) as new items are published. Target <5 minute lag between publish and availability in the ANN index.
- User profile updates: Write-through on each interaction event. For high-traffic users, batch updates every N events to avoid hot keys in Redis.
- Multi-language support: Use multilingual embedding models (e.g., multilingual-e5) and maintain separate per-language ANN indexes, or a single shared index with language as a filter dimension.
- Diversity: Pure similarity retrieval produces redundant clusters. Apply Maximal Marginal Relevance (MMR) at re-ranking time to balance relevance with diversity in the final result set.
Summary
Content-based filtering builds on item feature vectors and aggregated user interest profiles. Dense embeddings from transformer models give it strong semantic retrieval quality, while TF-IDF and tag signals provide interpretable re-ranking features. Its cold-start resilience makes it an ideal complement to collaborative filtering: serve content-based results for new users and new items, blend in collaborative signals as interaction data accumulates, and apply diversity constraints to prevent over-specialization.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is content-based filtering in a recommendation system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-based filtering recommends items by comparing the attributes or features of items a user has interacted with against the features of candidate items. Rather than relying on other users’ behavior, it builds a profile of the user’s preferences from item metadata such as genre, tags, description embeddings, or structured attributes.”
}
},
{
“@type”: “Question”,
“name”: “How does content-based filtering solve the cold-start problem?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-based filtering can recommend items to a new user as soon as they interact with even a single item, because recommendations are derived from item features rather than historical user-user similarities. For a new item, as long as its content features (text, metadata, embeddings) are available, it can immediately be recommended to users whose profiles match those features.”
}
},
{
“@type”: “Question”,
“name”: “What are the low-level design components of a content-based filtering system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Core components include: an item feature extraction pipeline (NLP models, metadata parsers, image encoders), a user profile store that aggregates feature vectors from past interactions, a similarity computation service using cosine similarity or dot product over dense embeddings, an ANN index for fast retrieval, and a serving layer that assembles and ranks candidates before returning results.”
}
},
{
“@type”: “Question”,
“name”: “How do Netflix and Google use content-based filtering in their products?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Netflix uses content-based signals such as genre, cast, director, and mood tags derived from video metadata and viewer behavior to bootstrap recommendations for new titles and cold-start users. Google applies content-based filtering in YouTube and Google News, encoding video transcripts and article text into dense embeddings that are matched against user interest profiles built from watch and read history.”
}
}
]
}
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety