ML Training Pipeline Low-Level Design: Data Preprocessing, Experiment Tracking, and Model Registry

Pipeline Stages Overview

An ML training pipeline moves data through a sequence of well-defined stages: data ingestion, feature engineering, model training, evaluation, registration, and deployment. Each stage produces versioned artifacts consumed by the next, making the entire workflow reproducible and auditable.

Data Versioning

Raw datasets must be versioned alongside model code. Two common approaches:

  • DVC (Data Version Control): Git-like versioning for large files stored in S3 or GCS. A .dvc pointer file is committed to git, while the actual data lives in remote storage. Any training run can check out the exact dataset that produced a given model.
  • Delta Lake: ACID-compliant table format on top of Parquet. Time-travel queries let you reproduce any historical snapshot of a dataset with SELECT * FROM table VERSION AS OF 42.

Without dataset versioning, debugging a model regression is nearly impossible — you cannot know whether the change came from code, data, or hyperparameters.

Feature Engineering

Preprocessing steps applied during training must be identical during serving. Common transformations:

  • Normalization: fit StandardScaler on training set, apply same parameters at inference — never refit on test data.
  • Categorical encoding: ordinal or one-hot encoding with a fixed vocabulary built from training data.
  • Train/val/test split: stratified split to preserve class distribution; test set is held out until final evaluation.

The training-serving skew problem arises when preprocessing logic diverges between training and production. A feature store (Feast, Tecton) solves this by centralizing feature computation and serving precomputed features to both training jobs and inference services from the same source.

Distributed Training

Large models do not fit on a single GPU. Two parallelism strategies:

  • Data parallelism (PyTorch DDP): Each GPU holds a full model copy and processes a shard of the mini-batch. After the backward pass, gradients are averaged across all GPUs via an all-reduce operation. Scales linearly with GPU count for most workloads.
  • Model parallelism: Layers or tensor shards are split across GPUs. Required when a single layer's weights exceed GPU memory. Pipeline parallelism (GPipe) staggers micro-batches across model stages to reduce GPU idle time.

Experiment Tracking

MLflow is the most widely used open-source experiment tracker:

  • Log hyperparameters: mlflow.log_param("lr", 1e-3)
  • Log metrics per epoch: mlflow.log_metric("val_loss", val_loss, step=epoch)
  • Log artifacts: model checkpoints, confusion matrices, feature importance plots
  • The MLflow UI displays metric plots and allows side-by-side run comparison

Every training run receives a unique run_id that links to its parameters, metrics, and output artifacts — essential for reproducing any prior result.

Hyperparameter Tuning

Manual grid search is impractical for large search spaces. Automated tuning frameworks:

  • Optuna: Bayesian optimization with TPE sampler. Define a search space and an objective function; Optuna samples configurations, runs trials, and prunes unpromising runs early via the Hyperband algorithm.
  • Ray Tune: Distributed tuning on a Ray cluster. Trials run as parallel actors; integrates with Optuna/HyperOpt samplers. Supports population-based training (PBT) for adaptive schedules.

Evaluation Metrics

Offline metrics measure model quality on held-out data:

  • Classification: accuracy, F1, AUC-ROC, precision-recall curve
  • Ranking/recommendation: precision@K, NDCG, MRR
  • Regression: RMSE, MAE, R²

Offline metrics must correlate with business metrics — revenue lift, CTR, user retention. A model with higher AUC but lower CTR in A/B testing indicates offline-online misalignment. Validate this alignment before any promotion to production.

Model Registry

The model registry is a versioned catalog of trained model artifacts. Each registered version carries:

  • The MLflow run_id that produced it
  • Evaluation metrics on the validation and test sets
  • Dataset version reference
  • A lifecycle stage: None → Staging → Production → Archived

MLflow Model Registry supports this workflow natively. Promotion between stages is gated by evaluation criteria and requires explicit approval.

Promotion Workflow and Serving Deployment

The promotion path from training to production:

  1. Training job completes, model logged to MLflow
  2. Evaluation pipeline runs on test set — must meet metric thresholds
  3. Model promoted to Staging for shadow evaluation or offline A/B comparison
  4. A/B test against current Production model on live traffic
  5. Model promoted to Production if online metrics improve

Serving export options: ONNX for cross-framework portability; TorchServe or TensorFlow Serving for native framework deployment; a feature store + scoring microservice for online prediction with precomputed features.

Drift Detection

Production models degrade as input distributions shift over time. Monitor feature distributions in production vs. training using:

  • PSI (Population Stability Index): PSI > 0.2 indicates significant shift
  • KL divergence: measures information distance between distributions

Alert on drift crossing thresholds and trigger retraining. Label drift (output distribution shift without input shift) signals concept drift — the relationship between features and labels has changed.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top