What is a Feature Store?
A feature store is a centralized repository for ML features. It solves the “feature engineering duplication” problem: without a feature store, each ML team reimplements the same feature computations (e.g., “user’s average purchase value in the last 30 days”), leading to inconsistency between training and serving, duplicated engineering effort, and bugs when the computation differs between environments. A feature store provides: one canonical feature definition, consistent values at training time and serving time, and reuse across models.
Requirements
- Store pre-computed features for millions of entities (users, items, sessions)
- Serve features for online inference with <10ms latency
- Provide historical features for offline model training (point-in-time correct)
- Support batch feature computation and real-time (streaming) feature updates
- Feature versioning and lineage tracking
Architecture: Online + Offline Store
Offline Store (training):
Raw data (S3/DWH) → Feature Pipeline (Spark/dbt) → Parquet on S3
→ BigQuery / Snowflake
Training: point-in-time join of features at label timestamp
Online Store (serving):
Feature Pipeline → Redis / DynamoDB (key=entity_id, value=feature_vector)
Inference: fetch all features for entity_id in <10ms
Streaming Features:
Events → Kafka → Flink (real-time aggregations) → Online Store
→ Offline Store (event log for training)
Data Model
FeatureGroup(group_id, name, entity_type ENUM(USER,ITEM,SESSION),
description, owner, version, created_at)
Feature(feature_id, group_id, name, dtype ENUM(FLOAT,INT,STRING,EMBEDDING),
description, default_value, created_at)
FeatureValue(entity_id VARCHAR, feature_id UUID, value JSONB,
event_timestamp TIMESTAMP, created_timestamp TIMESTAMP)
-- Online: latest value per entity
-- Offline: full history (partitioned by date)
Point-in-Time Correct Joins (Training)
The critical requirement for training: when creating a training row for a label event at time T, the features must be the values that were available at time T — not future values (data leakage). Point-in-time join: for each (entity_id, label_timestamp) pair, find the most recent feature value with event_timestamp <= label_timestamp.
-- Point-in-time join (SQL):
SELECT labels.entity_id, labels.label, features.value
FROM labels
JOIN (
SELECT entity_id, value,
ROW_NUMBER() OVER (
PARTITION BY entity_id, labels.label_timestamp
ORDER BY event_timestamp DESC
) AS rn
FROM FeatureValue
WHERE event_timestamp <= labels.label_timestamp
) features ON features.entity_id = labels.entity_id AND features.rn = 1
Feast (open source feature store) and Tecton handle point-in-time joins automatically.
Online Feature Serving
Inference latency budget: 50ms total. Feature lookup: <10ms. Redis hash: HGETALL user:{user_id} returns all features for a user in one round trip. Store feature vectors as Redis hashes (field per feature). For embedding features (high-dimensional), store as binary (MSGPACK serialized). Batch lookup: for ranking 100 candidates, fetch all 100 at once using Redis pipeline. Pre-compute and cache: for frequently requested entities, maintain a warm cache (TTL=1h for static features, TTL=1min for real-time features).
Feature Freshness
Not all features need the same freshness. Tiered freshness strategy:
- Batch features (daily): total lifetime purchases, account age, demographic data. Computed by daily Spark job, written to online store once per day.
- Near-real-time features (hourly): user activity in the last hour, trending items. Computed by hourly aggregation job.
- Streaming features (seconds): user’s last action, real-time session activity. Computed by Flink from event stream.
Monitor feature freshness: alert if a feature hasn’t been updated within 2x its expected refresh interval.
Key Design Decisions
- Redis for online store — HGETALL in O(1), <1ms for cached features
- S3/Parquet for offline store — columnar format, efficient for training data reads
- Point-in-time joins enforced in training pipeline — prevents data leakage
- Tiered freshness — batch for static features, streaming for real-time signals
- Centralized feature definitions — eliminates training/serving skew
Databricks system design covers ML feature stores and data platforms. See common questions for Databricks interview: feature store and ML platform design.
LinkedIn system design covers ML platforms and feature stores. Review patterns for LinkedIn interview: ML feature store and recommendation system design.
Uber system design covers ML platforms and real-time features. See design patterns for Uber interview: ML feature store and real-time ML design.