What Is a Collaborative Filtering Service?
A collaborative filtering service generates personalized recommendations by identifying patterns in how users interact with items — ratings, clicks, purchases, or views — without relying on content attributes. It models the user-item interaction space as a sparse matrix, factors that matrix offline using Alternating Least Squares (ALS) to produce compact user and item embedding vectors, and then serves real-time recommendations at low latency by performing nearest-neighbor lookups against a pre-built vector index.
Requirements
Functional Requirements
- Collect implicit and explicit user-item interaction signals (views, purchases, ratings, skips).
- Train user and item embedding vectors offline via ALS matrix factorization on a regular schedule.
- Serve top-K item recommendations for a given user within 50 ms.
- Support cold-start handling for new users with no interaction history.
- Expose item-to-item similarity queries for “more like this” use cases.
Non-Functional Requirements
- Support 100 million users and 10 million items.
- Recommendation API p99 latency under 50 ms.
- ALS training job completes within 4 hours on a weekly cadence.
- Model freshness: new items indexed for similarity within 1 hour of publication.
Data Model
The Interaction event record stores user ID, item ID, interaction type, strength weight (e.g., purchase=5, view=1, skip=-1), and timestamp. These are written to an append-only Kafka topic and landed in a columnar store (Parquet on S3) for training. The UserEmbedding and ItemEmbedding records each store entity ID, embedding vector (float32 array of dimension d=128), model version, and training timestamp. Embeddings are stored in a vector database (Faiss index on disk, served via a dedicated vector service) and also in Redis as serialized byte arrays for the fastest possible serving path for active users.
Core Algorithms
ALS Matrix Factorization
ALS decomposes the sparse user-item interaction matrix R (m users x n items) into two dense factor matrices: user matrix U (m x d) and item matrix V (n x d), minimizing the reconstruction loss for observed entries plus L2 regularization. In each iteration, U is solved with V fixed (closed-form least squares per user row), then V is solved with U fixed. This alternation makes each step a parallelizable batch of independent linear-algebra problems. The training job runs on a Spark cluster: interactions are loaded from Parquet, the Spark MLlib ALS implementation runs for 15 iterations with rank=128 and lambda=0.1, and the resulting factor matrices are written back to S3 and pushed to the serving layer.
Online Serving via ANN
At serving time, the user embedding vector is fetched from Redis (or the vector database for cold cache) and queried against the item embedding index using approximate nearest-neighbor search (HNSW graph in Faiss). HNSW provides sub-millisecond query time at 95%+ recall for a 10-million-item index with d=128 at ef_search=50. The top-K candidates are post-filtered to remove already-consumed items (looked up from the user interaction cache) and re-ranked by a lightweight business logic layer (e.g., boost in-stock items, apply diversity constraints) before returning the final result list.
Cold-Start Handling
New users with fewer than 5 interactions receive popularity-based recommendations from a pre-computed global top-K list segmented by category, refreshed hourly. After 5 interactions, the user is assigned an initial embedding via item-based averaging: the mean of the embedding vectors of their interacted items serves as a proxy user vector until the next ALS training run produces a proper user embedding.
API Design
The recommendation API exposes GET /v1/recommend?user_id={id}&limit={k}&context={home|pdp|email}. The context parameter selects different post-filtering rules (e.g., email context excludes recently viewed items). An item-to-item endpoint at GET /v1/similar?item_id={id}&limit={k} queries item embeddings directly without a user vector. A POST /v1/interactions endpoint accepts batches of interaction events, writing to Kafka for async training ingestion and optionally updating the user interaction cache synchronously for immediate filtering. All endpoints return item IDs with scores; item metadata is fetched by the client from the item catalog service to keep the recommendation service decoupled from catalog data.
Scalability and Infrastructure
The serving layer is split into two tiers. The embedding fetch tier reads from Redis (user embeddings for active users, with a 24-hour TTL) and falls back to the vector database for cold users. The ANN query tier runs Faiss HNSW indexes in-process on serving nodes, each node holding the full item index in RAM (128 dimensions x 10M items x 4 bytes = ~5 GB). Index updates for new items run hourly: a batch job computes item embeddings for newly published items using the current item factor matrix, appends them to the Faiss index with an online add operation, and publishes the updated index to all serving nodes via a versioned object store path with a rolling reload. User embedding refreshes happen after each weekly ALS run via a pipeline that serializes all user vectors and streams them into Redis with pipeline batching.
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering