What is a Feature Store?
A feature store is a centralized repository for ML features. It solves the “feature engineering duplication” problem: without a feature store, each ML team reimplements the same feature computations (e.g., “user’s average purchase value in the last 30 days”), leading to inconsistency between training and serving, duplicated engineering effort, and bugs when the computation differs between environments. A feature store provides: one canonical feature definition, consistent values at training time and serving time, and reuse across models.
Requirements
- Store pre-computed features for millions of entities (users, items, sessions)
- Serve features for online inference with <10ms latency
- Provide historical features for offline model training (point-in-time correct)
- Support batch feature computation and real-time (streaming) feature updates
- Feature versioning and lineage tracking
Architecture: Online + Offline Store
Offline Store (training):
Raw data (S3/DWH) → Feature Pipeline (Spark/dbt) → Parquet on S3
→ BigQuery / Snowflake
Training: point-in-time join of features at label timestamp
Online Store (serving):
Feature Pipeline → Redis / DynamoDB (key=entity_id, value=feature_vector)
Inference: fetch all features for entity_id in <10ms
Streaming Features:
Events → Kafka → Flink (real-time aggregations) → Online Store
→ Offline Store (event log for training)
Data Model
FeatureGroup(group_id, name, entity_type ENUM(USER,ITEM,SESSION),
description, owner, version, created_at)
Feature(feature_id, group_id, name, dtype ENUM(FLOAT,INT,STRING,EMBEDDING),
description, default_value, created_at)
FeatureValue(entity_id VARCHAR, feature_id UUID, value JSONB,
event_timestamp TIMESTAMP, created_timestamp TIMESTAMP)
-- Online: latest value per entity
-- Offline: full history (partitioned by date)
Point-in-Time Correct Joins (Training)
The critical requirement for training: when creating a training row for a label event at time T, the features must be the values that were available at time T — not future values (data leakage). Point-in-time join: for each (entity_id, label_timestamp) pair, find the most recent feature value with event_timestamp <= label_timestamp.
-- Point-in-time join (SQL):
SELECT labels.entity_id, labels.label, features.value
FROM labels
JOIN (
SELECT entity_id, value,
ROW_NUMBER() OVER (
PARTITION BY entity_id, labels.label_timestamp
ORDER BY event_timestamp DESC
) AS rn
FROM FeatureValue
WHERE event_timestamp <= labels.label_timestamp
) features ON features.entity_id = labels.entity_id AND features.rn = 1
Feast (open source feature store) and Tecton handle point-in-time joins automatically.
Online Feature Serving
Inference latency budget: 50ms total. Feature lookup: <10ms. Redis hash: HGETALL user:{user_id} returns all features for a user in one round trip. Store feature vectors as Redis hashes (field per feature). For embedding features (high-dimensional), store as binary (MSGPACK serialized). Batch lookup: for ranking 100 candidates, fetch all 100 at once using Redis pipeline. Pre-compute and cache: for frequently requested entities, maintain a warm cache (TTL=1h for static features, TTL=1min for real-time features).
Feature Freshness
Not all features need the same freshness. Tiered freshness strategy:
- Batch features (daily): total lifetime purchases, account age, demographic data. Computed by daily Spark job, written to online store once per day.
- Near-real-time features (hourly): user activity in the last hour, trending items. Computed by hourly aggregation job.
- Streaming features (seconds): user’s last action, real-time session activity. Computed by Flink from event stream.
Monitor feature freshness: alert if a feature hasn’t been updated within 2x its expected refresh interval.
Key Design Decisions
- Redis for online store — HGETALL in O(1), <1ms for cached features
- S3/Parquet for offline store — columnar format, efficient for training data reads
- Point-in-time joins enforced in training pipeline — prevents data leakage
- Tiered freshness — batch for static features, streaming for real-time signals
- Centralized feature definitions — eliminates training/serving skew
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is training-serving skew in ML and how does a feature store prevent it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Training-serving skew occurs when the features used during model training differ from those used during inference. Example: during training, you compute "user's average purchase in the last 30 days" using a Spark job. During serving, you recompute the same feature using a slightly different SQL query. The resulting values differ — the model sees inputs during serving that it never saw during training, degrading accuracy. A feature store prevents this by: (1) storing a single canonical feature definition and computation logic. (2) Both training and serving read from the same feature store — training uses the offline store (historical values), serving uses the online store (latest values). The computation code runs once. (3) Point-in-time correct joins for training — features are joined at the exact timestamp of the label event, not the current timestamp. This eliminates both data leakage and skew.”}},{“@type”:”Question”,”name”:”What is a point-in-time correct join in a feature store?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When creating a training dataset, for each label event (e.g., "user clicked ad at time T"), you need the feature values as they were at time T — not the current values (which would be data leakage — using future information). A point-in-time correct join: for each (entity_id, label_timestamp) row, find the most recent feature value with event_timestamp <= label_timestamp. In SQL: for each label, find the feature row with the highest event_timestamp that is still <= label.label_timestamp. This is typically implemented as a "as-of join" or "time-travel join". Feature stores like Feast and Tecton automate this. Without point-in-time correctness: if a user churned and you join their current features (post-churn) to predict whether they will churn, your model learns from the future — it will overfit to churn signals and fail in production.”}},{“@type”:”Question”,”name”:”What is the difference between an online store and an offline store in a feature store?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Online store: low-latency key-value store (Redis, DynamoDB) holding the latest feature values for each entity. Used during model inference — the prediction service queries the online store to get the current feature vector for a user/item within the latency budget (< 10ms). Stores only the most recent value per entity (no history). Offline store: high-throughput columnar storage (S3/Parquet, BigQuery, Snowflake) holding the full history of feature values. Used for: (1) Training data generation — retrieve features at historical timestamps via point-in-time joins. (2) Backfills — compute features for historical events. (3) Feature debugging and analysis. Both stores are populated by the same feature computation pipelines. The online store is updated on every new value; the offline store appends new rows. Keeping them in sync is the main operational challenge.”}},{“@type”:”Question”,”name”:”How do you handle real-time (streaming) features in a feature store?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Some features must be fresh within seconds: "user's actions in the last 5 minutes," "number of failed login attempts in the last hour," "current session page view count." Batch pipelines cannot provide this freshness. Streaming feature pipeline: events → Kafka → Flink stream processor → computes rolling aggregations → writes to online store (Redis) immediately → writes to offline store (Kafka → S3 archive) for training. Flink maintains the aggregation state (e.g., a sliding window sum) in RocksDB. On each new event, the state is updated and the new feature value is written to Redis. Freshness: seconds to minutes. Challenge: the offline store for streaming features is the event log. To compute streaming features for a historical training point T, you must replay the event stream up to T — this is expensive. Tools like Tecton and Chronon handle streaming feature backfills automatically.”}},{“@type”:”Question”,”name”:”How does feature versioning work in a feature store?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Feature definitions change over time: a feature computation is corrected, a new signal is added, or the data source changes. Versioning tracks these changes. Version 1 of "user_avg_purchase_30d" might use order table; version 2 includes cancellations. Models trained on v1 features should serve with v1 features; models trained on v2 should serve with v2. Implementation: (1) FeatureGroup versioning — each version is a separate entity in the catalog. The feature computation code is tagged with a version. (2) Model registry link — the model artifact records which feature versions it was trained on. The serving system fetches the right version. (3) Immutable historical features — once a feature value is written to the offline store with a timestamp, it is never overwritten. The offline store is append-only. (4) Deprecation workflow — feature version is marked DEPRECATED; new model training must use newer versions; old serving continues until all models using the old version are retrained.”}}]}
Databricks system design covers ML feature stores and data platforms. See common questions for Databricks interview: feature store and ML platform design.
LinkedIn system design covers ML platforms and feature stores. Review patterns for LinkedIn interview: ML feature store and recommendation system design.
Uber system design covers ML platforms and real-time features. See design patterns for Uber interview: ML feature store and real-time ML design.