ML Feature Store Low-Level Design: Feature Ingestion, Online/Offline Serving, and Point-in-Time Correctness

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is point-in-time correctness in a feature store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Point-in-time correctness means that when constructing a training dataset, each label row is joined with feature values that were available at or before the label timestamp — not values written later. This prevents data leakage from future information bleeding into model training.”
}
},
{
“@type”: “Question”,
“name”: “How do online and offline stores stay consistent in a feature store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Both stores are written from the same pipeline. Batch jobs write Parquet files to S3 (offline) and backfill Redis keys (online). Streaming consumers update Redis in real time and also write to an append log that keeps the offline store eventually consistent. Consistency gaps are tolerated because offline serving is used for training (historical accuracy) while online serving is used for inference (low latency).”
}
},
{
“@type”: “Question”,
“name”: “How does feature versioning work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each feature carries a version integer. Backward-compatible changes (like adding metadata) keep the same version. Breaking schema changes — such as changing a feature's dtype or entity key — increment the version and create a new feature record. Models are pinned to a specific feature version at training time to guarantee reproducibility.”
}
},
{
“@type”: “Question”,
“name”: “How is feature drift detected between training and serving?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A monitoring job periodically samples online feature values and compares their distribution (mean, standard deviation, null rate, histogram) against the baseline distribution recorded at training time. Statistically significant shifts — detected via PSI (Population Stability Index) or KL divergence — trigger alerts so data engineers can investigate upstream data issues before model quality degrades.”
}
}
]
}

What Is an ML Feature Store?

An ML feature store is a centralized system for computing, storing, versioning, and serving features — the transformed input signals fed to machine learning models. It solves two problems simultaneously: providing low-latency feature lookups during online inference and providing historically accurate, point-in-time correct feature snapshots during model training. Without a feature store, teams rewrite the same transformation logic in both offline training pipelines and online serving code, creating consistency bugs and data leakage.

Dual Store Architecture

The canonical feature store design separates storage into two tiers:

Online store (Redis): Stores the current value of each feature per entity. Keyed as feature:{feature_name}:{entity_id}. Serves inference requests in under 5 ms. Each key carries a TTL (online_ttl_seconds) so stale features expire automatically.
Offline store (Parquet on S3): Stores the full history of feature values with event timestamps. Partitioned by date. Used to construct training datasets with point-in-time correct joins. Queried via Spark or DuckDB.

The two stores are written from shared pipelines — not separate codebases — so transformation logic is defined once and applied consistently to both.

Feature Ingestion Pipelines

Batch Ingestion

A scheduled Spark job reads raw source data (data warehouse, data lake), applies transformations defined in the feature registry, writes output rows as Parquet partitions to S3, and backfills Redis with the latest value per entity. Runs daily or hourly depending on feature freshness requirements.

Streaming Ingestion

A Kafka consumer reads events in real time (user clicks, transactions, sensor readings), applies the same transformation logic, and writes updated values to Redis immediately. It also appends events to a Kafka topic consumed by a micro-batch job that writes to the offline store, keeping both tiers consistent within minutes.

Point-in-Time Correct Training Joins

Data leakage is the most dangerous bug in training pipeline design. Consider a churn prediction model: if you join user features using today's values but your labels are from six months ago, your model sees features computed after the label event — features that would not have been available at prediction time.

Point-in-time correct joins solve this with AS OF semantics:

Start with a label DataFrame containing (entity_id, label_timestamp, label_value).
For each row, query the offline store for the latest feature value where event_timestamp <= label_timestamp.
Use a range join in Spark or a window function in SQL to efficiently fetch the correct historical snapshot.

This guarantee is the primary value proposition of a dedicated feature store over ad-hoc data lake queries.

Feature Registry

The registry is the metadata catalog for all features. Each feature record stores: name, entity type (user, item, session), dtype, description, owner, current version, and online_ttl_seconds. Feature groups bundle related features under a single name for easy retrieval. The registry is the source of truth referenced by both ingestion pipelines and serving code.

SQL Schema

-- Core feature metadata
CREATE TABLE Feature (
    id              BIGSERIAL PRIMARY KEY,
    name            VARCHAR(255) NOT NULL,
    entity_type     VARCHAR(100) NOT NULL,
    dtype           VARCHAR(50)  NOT NULL,   -- float, int, string, embedding
    version         INT          NOT NULL DEFAULT 1,
    description     TEXT,
    online_ttl_seconds INT       NOT NULL DEFAULT 86400,
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    UNIQUE (name, version)
);

-- Historical feature values (offline store metadata / append log)
CREATE TABLE FeatureValue (
    feature_id      BIGINT       NOT NULL REFERENCES Feature(id),
    entity_id       VARCHAR(255) NOT NULL,
    value           JSONB        NOT NULL,
    event_timestamp TIMESTAMPTZ  NOT NULL,
    ingested_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (feature_id, entity_id, event_timestamp)
);
CREATE INDEX ON FeatureValue (feature_id, entity_id, event_timestamp DESC);

-- Feature groups: named bundles of features
CREATE TABLE FeatureGroup (
    name        VARCHAR(255) PRIMARY KEY,
    features    JSONB        NOT NULL,  -- list of feature names
    owner       VARCHAR(255) NOT NULL,
    created_at  TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

-- Monitoring snapshots
CREATE TABLE FeatureMonitor (
    feature_id    BIGINT      NOT NULL REFERENCES Feature(id),
    mean          FLOAT,
    stddev        FLOAT,
    null_rate     FLOAT,
    computed_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (feature_id, computed_at)
);

Python Interface

import redis
import hashlib, json
from datetime import datetime
import pandas as pd

r = redis.Redis(host="redis-feature-store", port=6379, decode_responses=True)

def get_online_features(entity_id: str, feature_names: list[str]) -> dict:
    """Fetch current feature values from Redis for online inference."""
    pipe = r.pipeline()
    for name in feature_names:
        pipe.get(f"feature:{name}:{entity_id}")
    values = pipe.execute()
    return {name: json.loads(v) if v else None for name, v in zip(feature_names, values)}

def get_offline_features(
    entity_ids: list[str],
    feature_names: list[str],
    as_of_timestamps: list[datetime],
) -> pd.DataFrame:
    """
    Point-in-time correct fetch from offline store.
    Returns one row per (entity_id, as_of_timestamp) with feature columns.
    Implemented via Spark range join or DuckDB query over Parquet partitions.
    """
    label_df = pd.DataFrame({
        "entity_id": entity_ids,
        "as_of_timestamp": as_of_timestamps,
    })
    results = []
    for _, row in label_df.iterrows():
        features = _query_offline_store(row["entity_id"], feature_names, row["as_of_timestamp"])
        results.append({"entity_id": row["entity_id"], **features})
    return pd.DataFrame(results)

def _query_offline_store(entity_id, feature_names, as_of_ts):
    # Executes: SELECT value FROM FeatureValue
    # WHERE entity_id = %s AND feature_id IN (...) AND event_timestamp  None:
    """
    Write a batch of feature values to both stores.
    df must have columns: entity_id, event_timestamp, plus one column per feature.
    """
    pipe = r.pipeline()
    for _, row in df.iterrows():
        for col in df.columns:
            if col in ("entity_id", "event_timestamp"):
                continue
            key = f"feature:{col}:{row['entity_id']}"
            pipe.set(key, json.dumps(row[col]), ex=86400)
        # Also write to offline store (Parquet append / DB insert) -- omitted here
    pipe.execute()

def detect_drift(feature_id: int, current_sample: list[float], baseline_mean: float, baseline_std: float) -> bool:
    """
    Simple z-score drift check: flag if sample mean deviates more than 3 sigma from baseline.
    Production systems use PSI or KL divergence over full distributions.
    """
    import statistics
    if not current_sample:
        return False
    sample_mean = statistics.mean(current_sample)
    if baseline_std == 0:
        return sample_mean != baseline_mean
    z = abs(sample_mean - baseline_mean) / baseline_std
    return z > 3.0

Feature Versioning Strategy

When a feature schema changes in a backward-incompatible way — dtype change, entity key rename, semantic redefinition — increment the version and register a new Feature row. Old models continue reading feature_name:v1 keys from Redis. New models are trained against feature_name:v2 data. Both versions run in parallel during the cutover window. Once all models have migrated, the old version's ingestion pipeline is stopped and its Redis keys are allowed to expire.

Monitoring and Alerting

A nightly monitoring job samples online feature values, computes mean, stddev, and null rate, stores them in FeatureMonitor, and compares against the baseline captured at training time. Alerts fire when null rate increases sharply (upstream data outage), mean shifts beyond threshold (pipeline bug or data source change), or a feature stops updating (ingestion pipeline failure). Monitoring is the operational backbone of a production feature store.

Design Trade-offs

Online store freshness vs. cost: Shorter TTL and higher streaming throughput improve freshness but increase Redis memory and Kafka throughput costs.
Point-in-time join cost: Range joins over large Parquet datasets are expensive. Pre-materializing training snapshots at common timestamps amortizes repeated training runs.
Registry complexity: A rich registry with lineage, SLAs, and tagging adds operational overhead but pays dividends when debugging model regressions caused by upstream feature changes.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is point-in-time correctness and why does it matter?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Point-in-time correctness means that when building a training dataset, each sample uses only feature values that were available at the time of the label event. Using future feature values causes data leakage and produces models that perform well in training but poorly in production.”
}
},
{
“@type”: “Question”,
“name”: “How are online and offline feature stores kept consistent?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The same ingestion pipeline writes to both stores; the offline store receives all historical values while the online store keeps only the latest value per entity. Eventual consistency is acceptable because the two stores serve different purposes.”
}
},
{
“@type”: “Question”,
“name”: “How does feature versioning handle breaking schema changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A breaking change (e.g., changing a feature's dtype or semantics) creates a new feature version; models pin the version they were trained on; old and new versions coexist in the store until all dependent models migrate.”
}
},
{
“@type”: “Question”,
“name”: “How is feature drift detected between training and serving?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “FeatureMonitor records distribution statistics (mean, stddev, null_rate) computed from training data; the same statistics are computed from recent serving traffic and compared using a PSI (Population Stability Index) threshold.”
}
}
]
}