How is the user-item matrix structured in a collaborative filtering system?

The user-item matrix is an M×N sparse matrix where M is the number of users and N the number of items. Each cell holds an explicit rating or an implicit signal (e.g., click count, watch duration normalized to [0,1]). Because most users interact with a tiny fraction of items, the matrix is extremely sparse (>99.9% zeros in large systems). It is stored in a compressed sparse row (CSR) format or as a list of (user_id, item_id, value) triples in columnar storage like Parquet for efficient batch access.

How does ALS (Alternating Least Squares) training work for collaborative filtering?

ALS decomposes the user-item matrix into two low-rank factor matrices: U (users × k) and V (items × k). It alternates between holding V fixed and solving for each user vector in closed form, then holding U fixed and solving for each item vector. Each sub-problem is an independent least-squares solve, making the algorithm embarrassingly parallel. Distributed ALS (as in Spark MLlib) partitions user and item factor updates across workers, enabling training on matrices with hundreds of millions of entries.

How do you serve collaborative filtering recommendations with low latency using approximate lookup?

After training, item factor vectors are indexed in an Approximate Nearest Neighbor (ANN) structure such as HNSW (via Faiss or ScaNN). At query time, the user's factor vector is retrieved from a key-value store (Redis or DynamoDB) and used as the query vector. The ANN index returns the top-k nearest item vectors in milliseconds without exhaustively scanning all items. The ANN index is rebuilt periodically from the latest trained factors and hot-swapped atomically to avoid serving stale results.

How do you handle cold-start in a collaborative filtering system?

Cold-start for new users is addressed by falling back to popularity-based or content-based recommendations until enough interactions accumulate (typically 5-10 events). A session-based model (e.g., a recurrent or transformer architecture over the current session's item sequence) can generate immediate personalization without historical data. For new items, their factor vector is estimated by averaging the factors of content-similar items using metadata embeddings, then refined as real interaction data arrives.

Collaborative Filtering Service Low-Level Design: User-Item Matrix, ALS Training, and Online Serving

⏱ 6 min read

What Is a Collaborative Filtering Service?

A collaborative filtering service generates personalized recommendations by identifying patterns in how users interact with items — ratings, clicks, purchases, or views — without relying on content attributes. It models the user-item interaction space as a sparse matrix, factors that matrix offline using Alternating Least Squares (ALS) to produce compact user and item embedding vectors, and then serves real-time recommendations at low latency by performing nearest-neighbor lookups against a pre-built vector index.

Requirements

Functional Requirements

Collect implicit and explicit user-item interaction signals (views, purchases, ratings, skips).
Train user and item embedding vectors offline via ALS matrix factorization on a regular schedule.
Serve top-K item recommendations for a given user within 50 ms.
Support cold-start handling for new users with no interaction history.
Expose item-to-item similarity queries for “more like this” use cases.

Non-Functional Requirements

Support 100 million users and 10 million items.
Recommendation API p99 latency under 50 ms.
ALS training job completes within 4 hours on a weekly cadence.
Model freshness: new items indexed for similarity within 1 hour of publication.

Data Model

The Interaction event record stores user ID, item ID, interaction type, strength weight (e.g., purchase=5, view=1, skip=-1), and timestamp. These are written to an append-only Kafka topic and landed in a columnar store (Parquet on S3) for training. The UserEmbedding and ItemEmbedding records each store entity ID, embedding vector (float32 array of dimension d=128), model version, and training timestamp. Embeddings are stored in a vector database (Faiss index on disk, served via a dedicated vector service) and also in Redis as serialized byte arrays for the fastest possible serving path for active users.

Core Algorithms

ALS Matrix Factorization

ALS decomposes the sparse user-item interaction matrix R (m users x n items) into two dense factor matrices: user matrix U (m x d) and item matrix V (n x d), minimizing the reconstruction loss for observed entries plus L2 regularization. In each iteration, U is solved with V fixed (closed-form least squares per user row), then V is solved with U fixed. This alternation makes each step a parallelizable batch of independent linear-algebra problems. The training job runs on a Spark cluster: interactions are loaded from Parquet, the Spark MLlib ALS implementation runs for 15 iterations with rank=128 and lambda=0.1, and the resulting factor matrices are written back to S3 and pushed to the serving layer.

Online Serving via ANN

At serving time, the user embedding vector is fetched from Redis (or the vector database for cold cache) and queried against the item embedding index using approximate nearest-neighbor search (HNSW graph in Faiss). HNSW provides sub-millisecond query time at 95%+ recall for a 10-million-item index with d=128 at ef_search=50. The top-K candidates are post-filtered to remove already-consumed items (looked up from the user interaction cache) and re-ranked by a lightweight business logic layer (e.g., boost in-stock items, apply diversity constraints) before returning the final result list.

Cold-Start Handling

New users with fewer than 5 interactions receive popularity-based recommendations from a pre-computed global top-K list segmented by category, refreshed hourly. After 5 interactions, the user is assigned an initial embedding via item-based averaging: the mean of the embedding vectors of their interacted items serves as a proxy user vector until the next ALS training run produces a proper user embedding.

API Design

The recommendation API exposes GET /v1/recommend?user_id={id}&limit={k}&context={home|pdp|email}. The context parameter selects different post-filtering rules (e.g., email context excludes recently viewed items). An item-to-item endpoint at GET /v1/similar?item_id={id}&limit={k} queries item embeddings directly without a user vector. A POST /v1/interactions endpoint accepts batches of interaction events, writing to Kafka for async training ingestion and optionally updating the user interaction cache synchronously for immediate filtering. All endpoints return item IDs with scores; item metadata is fetched by the client from the item catalog service to keep the recommendation service decoupled from catalog data.

Scalability and Infrastructure

The serving layer is split into two tiers. The embedding fetch tier reads from Redis (user embeddings for active users, with a 24-hour TTL) and falls back to the vector database for cold users. The ANN query tier runs Faiss HNSW indexes in-process on serving nodes, each node holding the full item index in RAM (128 dimensions x 10M items x 4 bytes = ~5 GB). Index updates for new items run hourly: a batch job computes item embeddings for newly published items using the current item factor matrix, appends them to the Faiss index with an online add operation, and publishes the updated index to all serving nodes via a versioned object store path with a rolling reload. User embedding refreshes happen after each weekly ALS run via a pipeline that serializes all user vectors and streams them into Redis with pipeline batching.