Low Level Design: Content-Based Filtering Service

What Is Content-Based Filtering?

Content-based filtering recommends items similar to those a user has already engaged with, based on the attributes of the items themselves rather than on other users behavior. If a user reads three articles about distributed systems, the engine finds other articles with similar topics, keywords, and structure. It does not require data from other users, which makes it cold-start friendly for new users — but it can over-specialize, creating a filter bubble.

Data Model

items (
  item_id      BIGINT PRIMARY KEY,
  title        VARCHAR(255),
  body         TEXT,
  category     VARCHAR(64),
  tags         JSON,          -- e.g. ['distributed-systems', 'databases']
  published_at TIMESTAMP
);

item_features (
  item_id      BIGINT PRIMARY KEY,
  tfidf_vector BLOB,    -- sparse TF-IDF vector
  dense_vector BLOB,    -- dense embedding, e.g. 384-dim from sentence-transformer
  updated_at   TIMESTAMP
);

user_profiles (
  user_id      BIGINT PRIMARY KEY,
  interest_vector BLOB,  -- aggregated embedding of engaged items
  updated_at   TIMESTAMP
);

interactions (
  user_id    BIGINT,
  item_id    BIGINT,
  weight     FLOAT,   -- view=1.0, like=2.0, share=3.0, skip=-0.5
  created_at TIMESTAMP,
  INDEX(user_id, created_at)
);

Feature Extraction

Two main approaches for representing item content:

TF-IDF (Term Frequency-Inverse Document Frequency): Classic sparse representation. For each item, tokenize and stem the text, compute TF-IDF weights across the corpus. Similarity is cosine distance over sparse vectors. Fast to compute, interpretable, works well for keyword-heavy domains.
Dense embeddings: Use a pretrained transformer (e.g., sentence-transformers, OpenAI Ada, or a fine-tuned BERT variant) to encode item text into a fixed-size dense vector. Captures semantic similarity — two articles about the same concept in different words will be close in embedding space. Requires more compute but dramatically outperforms TF-IDF on semantic tasks.

In practice, a hybrid approach works best: use dense embeddings for semantic retrieval, augment with TF-IDF or tag-overlap signals as re-ranking features.

User Profile Construction

The user interest vector is a weighted average of the item vectors the user has engaged with:

user_vector = sum(weight_i * item_vector_i) / sum(weight_i)
             for all (item_i, weight_i) in user interactions

Recent interactions are up-weighted using an exponential decay: weight_i *= exp(-lambda * days_since_interaction). This ensures that a user who pivoted from sports to tech content gets tech recommendations, not sports.

Recommendation Workflow

Fetch user_vector from user_profiles (recomputed on write or in batch).
Run ANN search (FAISS, Weaviate, or Qdrant) over item_features to retrieve top-K candidates by cosine similarity.
Filter out already-seen items and apply business rules (recency cap, category diversity).
Re-rank using a lightweight model incorporating tag overlap, freshness, and engagement rate signals.
Return final list; cache per user with TTL ~15 minutes.

Cold Start Problem

Content-based filtering is significantly more resilient to cold start than collaborative filtering:

New users: Even without interaction history, onboarding questions (preferred categories, topics) can populate an initial interest vector. One explicit preference is enough to start serving relevant content.
New items: As soon as item metadata is ingested and the embedding computed, the item is retrievable. No interaction history needed. This is a major advantage over collaborative filtering for publishers with fast content cycles.
Limitation: Without cross-user signal, content-based systems cannot surface serendipitous discoveries. A user who has only read Python tutorials will never see a relevant Rust article unless categories overlap.

Scalability Considerations

Embedding index size: At 384 floats (float32) per item, 10 million items require ~15 GB. Use product quantization (PQ) in FAISS to compress vectors 8-16x with minimal recall loss.
Embedding freshness: Run embedding inference in a streaming pipeline (Kafka + workers) as new items are published. Target <5 minute lag between publish and availability in the ANN index.
User profile updates: Write-through on each interaction event. For high-traffic users, batch updates every N events to avoid hot keys in Redis.
Multi-language support: Use multilingual embedding models (e.g., multilingual-e5) and maintain separate per-language ANN indexes, or a single shared index with language as a filter dimension.
Diversity: Pure similarity retrieval produces redundant clusters. Apply Maximal Marginal Relevance (MMR) at re-ranking time to balance relevance with diversity in the final result set.

Summary

Content-based filtering builds on item feature vectors and aggregated user interest profiles. Dense embeddings from transformer models give it strong semantic retrieval quality, while TF-IDF and tag signals provide interpretable re-ranking features. Its cold-start resilience makes it an ideal complement to collaborative filtering: serve content-based results for new users and new items, blend in collaborative signals as interaction data accumulates, and apply diversity constraints to prevent over-specialization.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is content-based filtering in a recommendation system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-based filtering recommends items by comparing the attributes or features of items a user has interacted with against the features of candidate items. Rather than relying on other users’ behavior, it builds a profile of the user’s preferences from item metadata such as genre, tags, description embeddings, or structured attributes.”
}
},
{
“@type”: “Question”,
“name”: “How does content-based filtering solve the cold-start problem?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Content-based filtering can recommend items to a new user as soon as they interact with even a single item, because recommendations are derived from item features rather than historical user-user similarities. For a new item, as long as its content features (text, metadata, embeddings) are available, it can immediately be recommended to users whose profiles match those features.”
}
},
{
“@type”: “Question”,
“name”: “What are the low-level design components of a content-based filtering system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Core components include: an item feature extraction pipeline (NLP models, metadata parsers, image encoders), a user profile store that aggregates feature vectors from past interactions, a similarity computation service using cosine similarity or dot product over dense embeddings, an ANN index for fast retrieval, and a serving layer that assembles and ranks candidates before returning results.”
}
},
{
“@type”: “Question”,
“name”: “How do Netflix and Google use content-based filtering in their products?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Netflix uses content-based signals such as genre, cast, director, and mood tags derived from video metadata and viewer behavior to bootstrap recommendations for new titles and cold-start users. Google applies content-based filtering in YouTube and Google News, encoding video transcripts and article text into dense embeddings that are matched against user interest profiles built from watch and read history.”
}
}
]
}