Q: What is the difference between an online store and an offline store in a feature store?

Online store: low-latency key-value store (Redis, DynamoDB) holding the latest feature values for each entity. Used during model inference — the prediction service queries the online store to get the current feature vector for a user/item within the latency budget (< 10ms). Stores only the most recent value per entity (no history). Offline store: high-throughput columnar storage (S3/Parquet, BigQuery, Snowflake) holding the full history of feature values. Used for: (1) Training data generation — retrieve features at historical timestamps via point-in-time joins. (2) Backfills — compute features for historical events. (3) Feature debugging and analysis. Both stores are populated by the same feature computation pipelines. The online store is updated on every new value; the offline store appends new rows. Keeping them in sync is the main operational challenge.

Q: How do you handle real-time (streaming) features in a feature store?

Some features must be fresh within seconds: "user's actions in the last 5 minutes," "number of failed login attempts in the last hour," "current session page view count." Batch pipelines cannot provide this freshness. Streaming feature pipeline: events → Kafka → Flink stream processor → computes rolling aggregations → writes to online store (Redis) immediately → writes to offline store (Kafka → S3 archive) for training. Flink maintains the aggregation state (e.g., a sliding window sum) in RocksDB. On each new event, the state is updated and the new feature value is written to Redis. Freshness: seconds to minutes. Challenge: the offline store for streaming features is the event log. To compute streaming features for a historical training point T, you must replay the event stream up to T — this is expensive. Tools like Tecton and Chronon handle streaming feature backfills automatically.

Q: How does feature versioning work in a feature store?

Feature definitions change over time: a feature computation is corrected, a new signal is added, or the data source changes. Versioning tracks these changes. Version 1 of "user_avg_purchase_30d" might use order table; version 2 includes cancellations. Models trained on v1 features should serve with v1 features; models trained on v2 should serve with v2. Implementation: (1) FeatureGroup versioning — each version is a separate entity in the catalog. The feature computation code is tagged with a version. (2) Model registry link — the model artifact records which feature versions it was trained on. The serving system fetches the right version. (3) Immutable historical features — once a feature value is written to the offline store with a timestamp, it is never overwritten. The offline store is append-only. (4) Deprecation workflow — feature version is marked DEPRECATED; new model training must use newer versions; old serving continues until all models using the old version are retrained.

Question 1

What is training-serving skew in ML and how does a feature store prevent it?

Accepted Answer

Training-serving skew occurs when the features used during model training differ from those used during inference. Example: during training, you compute "user's average purchase in the last 30 days" using a Spark job. During serving, you recompute the same feature using a slightly different SQL query. The resulting values differ — the model sees inputs during serving that it never saw during training, degrading accuracy. A feature store prevents this by: (1) storing a single canonical feature definition and computation logic. (2) Both training and serving read from the same feature store — training uses the offline store (historical values), serving uses the online store (latest values). The computation code runs once. (3) Point-in-time correct joins for training — features are joined at the exact timestamp of the label event, not the current timestamp. This eliminates both data leakage and skew.

Question 2

What is a point-in-time correct join in a feature store?

Accepted Answer

When creating a training dataset, for each label event (e.g., "user clicked ad at time T"), you need the feature values as they were at time T — not the current values (which would be data leakage — using future information). A point-in-time correct join: for each (entity_id, label_timestamp) row, find the most recent feature value with event_timestamp <= label_timestamp. In SQL: for each label, find the feature row with the highest event_timestamp that is still <= label.label_timestamp. This is typically implemented as a "as-of join" or "time-travel join". Feature stores like Feast and Tecton automate this. Without point-in-time correctness: if a user churned and you join their current features (post-churn) to predict whether they will churn, your model learns from the future — it will overfit to churn signals and fail in production.

Question 3

What is the difference between an online store and an offline store in a feature store?

Accepted Answer

Online store: low-latency key-value store (Redis, DynamoDB) holding the latest feature values for each entity. Used during model inference — the prediction service queries the online store to get the current feature vector for a user/item within the latency budget (< 10ms). Stores only the most recent value per entity (no history). Offline store: high-throughput columnar storage (S3/Parquet, BigQuery, Snowflake) holding the full history of feature values. Used for: (1) Training data generation — retrieve features at historical timestamps via point-in-time joins. (2) Backfills — compute features for historical events. (3) Feature debugging and analysis. Both stores are populated by the same feature computation pipelines. The online store is updated on every new value; the offline store appends new rows. Keeping them in sync is the main operational challenge.

Question 4

How do you handle real-time (streaming) features in a feature store?

Accepted Answer

Some features must be fresh within seconds: "user's actions in the last 5 minutes," "number of failed login attempts in the last hour," "current session page view count." Batch pipelines cannot provide this freshness. Streaming feature pipeline: events → Kafka → Flink stream processor → computes rolling aggregations → writes to online store (Redis) immediately → writes to offline store (Kafka → S3 archive) for training. Flink maintains the aggregation state (e.g., a sliding window sum) in RocksDB. On each new event, the state is updated and the new feature value is written to Redis. Freshness: seconds to minutes. Challenge: the offline store for streaming features is the event log. To compute streaming features for a historical training point T, you must replay the event stream up to T — this is expensive. Tools like Tecton and Chronon handle streaming feature backfills automatically.

Question 5

How does feature versioning work in a feature store?

Accepted Answer

Feature definitions change over time: a feature computation is corrected, a new signal is added, or the data source changes. Versioning tracks these changes. Version 1 of "user_avg_purchase_30d" might use order table; version 2 includes cancellations. Models trained on v1 features should serve with v1 features; models trained on v2 should serve with v2. Implementation: (1) FeatureGroup versioning — each version is a separate entity in the catalog. The feature computation code is tagged with a version. (2) Model registry link — the model artifact records which feature versions it was trained on. The serving system fetches the right version. (3) Immutable historical features — once a feature value is written to the offline store with a timestamp, it is never overwritten. The offline store is append-only. (4) Deprecation workflow — feature version is marked DEPRECATED; new model training must use newer versions; old serving continues until all models using the old version are retrained.

Feature Store System Low-Level Design

What is a Feature Store?

Requirements

Architecture: Online + Offline Store

Data Model

Point-in-Time Correct Joins (Training)

Online Feature Serving

Feature Freshness

Key Design Decisions