What Is an ML Feature Store?
An ML feature store is a centralized system for computing, storing, versioning, and serving features — the transformed input signals fed to machine learning models. It solves two problems simultaneously: providing low-latency feature lookups during online inference and providing historically accurate, point-in-time correct feature snapshots during model training. Without a feature store, teams rewrite the same transformation logic in both offline training pipelines and online serving code, creating consistency bugs and data leakage.
Dual Store Architecture
The canonical feature store design separates storage into two tiers:
- Online store (Redis): Stores the current value of each feature per entity. Keyed as
feature:{feature_name}:{entity_id}. Serves inference requests in under 5 ms. Each key carries a TTL (online_ttl_seconds) so stale features expire automatically. - Offline store (Parquet on S3): Stores the full history of feature values with event timestamps. Partitioned by date. Used to construct training datasets with point-in-time correct joins. Queried via Spark or DuckDB.
The two stores are written from shared pipelines — not separate codebases — so transformation logic is defined once and applied consistently to both.
Feature Ingestion Pipelines
Batch Ingestion
A scheduled Spark job reads raw source data (data warehouse, data lake), applies transformations defined in the feature registry, writes output rows as Parquet partitions to S3, and backfills Redis with the latest value per entity. Runs daily or hourly depending on feature freshness requirements.
Streaming Ingestion
A Kafka consumer reads events in real time (user clicks, transactions, sensor readings), applies the same transformation logic, and writes updated values to Redis immediately. It also appends events to a Kafka topic consumed by a micro-batch job that writes to the offline store, keeping both tiers consistent within minutes.
Point-in-Time Correct Training Joins
Data leakage is the most dangerous bug in training pipeline design. Consider a churn prediction model: if you join user features using today's values but your labels are from six months ago, your model sees features computed after the label event — features that would not have been available at prediction time.
Point-in-time correct joins solve this with AS OF semantics:
- Start with a label DataFrame containing
(entity_id, label_timestamp, label_value). - For each row, query the offline store for the latest feature value where
event_timestamp <= label_timestamp. - Use a range join in Spark or a window function in SQL to efficiently fetch the correct historical snapshot.
This guarantee is the primary value proposition of a dedicated feature store over ad-hoc data lake queries.
Feature Registry
The registry is the metadata catalog for all features. Each feature record stores: name, entity type (user, item, session), dtype, description, owner, current version, and online_ttl_seconds. Feature groups bundle related features under a single name for easy retrieval. The registry is the source of truth referenced by both ingestion pipelines and serving code.
SQL Schema
-- Core feature metadata
CREATE TABLE Feature (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
entity_type VARCHAR(100) NOT NULL,
dtype VARCHAR(50) NOT NULL, -- float, int, string, embedding
version INT NOT NULL DEFAULT 1,
description TEXT,
online_ttl_seconds INT NOT NULL DEFAULT 86400,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (name, version)
);
-- Historical feature values (offline store metadata / append log)
CREATE TABLE FeatureValue (
feature_id BIGINT NOT NULL REFERENCES Feature(id),
entity_id VARCHAR(255) NOT NULL,
value JSONB NOT NULL,
event_timestamp TIMESTAMPTZ NOT NULL,
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (feature_id, entity_id, event_timestamp)
);
CREATE INDEX ON FeatureValue (feature_id, entity_id, event_timestamp DESC);
-- Feature groups: named bundles of features
CREATE TABLE FeatureGroup (
name VARCHAR(255) PRIMARY KEY,
features JSONB NOT NULL, -- list of feature names
owner VARCHAR(255) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Monitoring snapshots
CREATE TABLE FeatureMonitor (
feature_id BIGINT NOT NULL REFERENCES Feature(id),
mean FLOAT,
stddev FLOAT,
null_rate FLOAT,
computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (feature_id, computed_at)
);
Python Interface
import redis
import hashlib, json
from datetime import datetime
import pandas as pd
r = redis.Redis(host="redis-feature-store", port=6379, decode_responses=True)
def get_online_features(entity_id: str, feature_names: list[str]) -> dict:
"""Fetch current feature values from Redis for online inference."""
pipe = r.pipeline()
for name in feature_names:
pipe.get(f"feature:{name}:{entity_id}")
values = pipe.execute()
return {name: json.loads(v) if v else None for name, v in zip(feature_names, values)}
def get_offline_features(
entity_ids: list[str],
feature_names: list[str],
as_of_timestamps: list[datetime],
) -> pd.DataFrame:
"""
Point-in-time correct fetch from offline store.
Returns one row per (entity_id, as_of_timestamp) with feature columns.
Implemented via Spark range join or DuckDB query over Parquet partitions.
"""
label_df = pd.DataFrame({
"entity_id": entity_ids,
"as_of_timestamp": as_of_timestamps,
})
results = []
for _, row in label_df.iterrows():
features = _query_offline_store(row["entity_id"], feature_names, row["as_of_timestamp"])
results.append({"entity_id": row["entity_id"], **features})
return pd.DataFrame(results)
def _query_offline_store(entity_id, feature_names, as_of_ts):
# Executes: SELECT value FROM FeatureValue
# WHERE entity_id = %s AND feature_id IN (...) AND event_timestamp None:
"""
Write a batch of feature values to both stores.
df must have columns: entity_id, event_timestamp, plus one column per feature.
"""
pipe = r.pipeline()
for _, row in df.iterrows():
for col in df.columns:
if col in ("entity_id", "event_timestamp"):
continue
key = f"feature:{col}:{row['entity_id']}"
pipe.set(key, json.dumps(row[col]), ex=86400)
# Also write to offline store (Parquet append / DB insert) -- omitted here
pipe.execute()
def detect_drift(feature_id: int, current_sample: list[float], baseline_mean: float, baseline_std: float) -> bool:
"""
Simple z-score drift check: flag if sample mean deviates more than 3 sigma from baseline.
Production systems use PSI or KL divergence over full distributions.
"""
import statistics
if not current_sample:
return False
sample_mean = statistics.mean(current_sample)
if baseline_std == 0:
return sample_mean != baseline_mean
z = abs(sample_mean - baseline_mean) / baseline_std
return z > 3.0
Feature Versioning Strategy
When a feature schema changes in a backward-incompatible way — dtype change, entity key rename, semantic redefinition — increment the version and register a new Feature row. Old models continue reading feature_name:v1 keys from Redis. New models are trained against feature_name:v2 data. Both versions run in parallel during the cutover window. Once all models have migrated, the old version's ingestion pipeline is stopped and its Redis keys are allowed to expire.
Monitoring and Alerting
A nightly monitoring job samples online feature values, computes mean, stddev, and null rate, stores them in FeatureMonitor, and compares against the baseline captured at training time. Alerts fire when null rate increases sharply (upstream data outage), mean shifts beyond threshold (pipeline bug or data source change), or a feature stops updating (ingestion pipeline failure). Monitoring is the operational backbone of a production feature store.
Design Trade-offs
- Online store freshness vs. cost: Shorter TTL and higher streaming throughput improve freshness but increase Redis memory and Kafka throughput costs.
- Point-in-time join cost: Range joins over large Parquet datasets are expensive. Pre-materializing training snapshots at common timestamps amortizes repeated training runs.
- Registry complexity: A rich registry with lineage, SLAs, and tagging adds operational overhead but pays dividends when debugging model regressions caused by upstream feature changes.
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering