ML Training Pipeline Low-Level Design: Data Preprocessing, Experiment Tracking, and Model Registry

Pipeline Stages Overview

An ML training pipeline moves data through a sequence of well-defined stages: data ingestion, feature engineering, model training, evaluation, registration, and deployment. Each stage produces versioned artifacts consumed by the next, making the entire workflow reproducible and auditable.

Data Versioning

Raw datasets must be versioned alongside model code. Two common approaches:

DVC (Data Version Control): Git-like versioning for large files stored in S3 or GCS. A .dvc pointer file is committed to git, while the actual data lives in remote storage. Any training run can check out the exact dataset that produced a given model.
Delta Lake: ACID-compliant table format on top of Parquet. Time-travel queries let you reproduce any historical snapshot of a dataset with SELECT * FROM table VERSION AS OF 42.

Without dataset versioning, debugging a model regression is nearly impossible — you cannot know whether the change came from code, data, or hyperparameters.

Feature Engineering

Preprocessing steps applied during training must be identical during serving. Common transformations:

Normalization: fit StandardScaler on training set, apply same parameters at inference — never refit on test data.
Categorical encoding: ordinal or one-hot encoding with a fixed vocabulary built from training data.
Train/val/test split: stratified split to preserve class distribution; test set is held out until final evaluation.

The training-serving skew problem arises when preprocessing logic diverges between training and production. A feature store (Feast, Tecton) solves this by centralizing feature computation and serving precomputed features to both training jobs and inference services from the same source.

Distributed Training

Large models do not fit on a single GPU. Two parallelism strategies:

Data parallelism (PyTorch DDP): Each GPU holds a full model copy and processes a shard of the mini-batch. After the backward pass, gradients are averaged across all GPUs via an all-reduce operation. Scales linearly with GPU count for most workloads.
Model parallelism: Layers or tensor shards are split across GPUs. Required when a single layer's weights exceed GPU memory. Pipeline parallelism (GPipe) staggers micro-batches across model stages to reduce GPU idle time.

Experiment Tracking

MLflow is the most widely used open-source experiment tracker:

Log hyperparameters: mlflow.log_param("lr", 1e-3)
Log metrics per epoch: mlflow.log_metric("val_loss", val_loss, step=epoch)
Log artifacts: model checkpoints, confusion matrices, feature importance plots
The MLflow UI displays metric plots and allows side-by-side run comparison

Every training run receives a unique run_id that links to its parameters, metrics, and output artifacts — essential for reproducing any prior result.

Hyperparameter Tuning

Manual grid search is impractical for large search spaces. Automated tuning frameworks:

Optuna: Bayesian optimization with TPE sampler. Define a search space and an objective function; Optuna samples configurations, runs trials, and prunes unpromising runs early via the Hyperband algorithm.
Ray Tune: Distributed tuning on a Ray cluster. Trials run as parallel actors; integrates with Optuna/HyperOpt samplers. Supports population-based training (PBT) for adaptive schedules.

Evaluation Metrics

Offline metrics measure model quality on held-out data:

Classification: accuracy, F1, AUC-ROC, precision-recall curve
Ranking/recommendation: precision@K, NDCG, MRR
Regression: RMSE, MAE, R²

Offline metrics must correlate with business metrics — revenue lift, CTR, user retention. A model with higher AUC but lower CTR in A/B testing indicates offline-online misalignment. Validate this alignment before any promotion to production.

Model Registry

The model registry is a versioned catalog of trained model artifacts. Each registered version carries:

The MLflow run_id that produced it
Evaluation metrics on the validation and test sets
Dataset version reference
A lifecycle stage: None → Staging → Production → Archived

MLflow Model Registry supports this workflow natively. Promotion between stages is gated by evaluation criteria and requires explicit approval.

Promotion Workflow and Serving Deployment

The promotion path from training to production:

Training job completes, model logged to MLflow
Evaluation pipeline runs on test set — must meet metric thresholds
Model promoted to Staging for shadow evaluation or offline A/B comparison
A/B test against current Production model on live traffic
Model promoted to Production if online metrics improve

Serving export options: ONNX for cross-framework portability; TorchServe or TensorFlow Serving for native framework deployment; a feature store + scoring microservice for online prediction with precomputed features.

Drift Detection

Production models degrade as input distributions shift over time. Monitor feature distributions in production vs. training using:

PSI (Population Stability Index): PSI > 0.2 indicates significant shift
KL divergence: measures information distance between distributions

Alert on drift crossing thresholds and trigger retraining. Label drift (output distribution shift without input shift) signals concept drift — the relationship between features and labels has changed.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design the data preprocessing stage of an ML pipeline to be both reproducible and scalable?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Represent each preprocessing step as a versioned, stateless transformation function that reads from an immutable input dataset (identified by a content hash or versioned URI in object storage) and writes to a new immutable output dataset. Orchestrate steps as a DAG (e.g., via Apache Airflow or Kubeflow Pipelines) so reruns are idempotent. Store transformation code alongside a requirements manifest in a container image tagged with a digest. Feature stores (e.g., Feast) serve as the boundary between preprocessing and training: computed features are written once and served consistently to both training and online inference, eliminating training-serving skew. Log input/output dataset URIs and code image digests to the experiment tracker on every run.”
}
},
{
“@type”: “Question”,
“name”: “What data does an experiment tracker need to capture, and how should it be queried?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An experiment tracker must capture: (1) hyperparameters as a JSON map, (2) metrics time-series (step, metric_name, value) for loss curves and eval scores, (3) dataset version URIs and code commit SHA, (4) hardware config (GPU type, count), and (5) artifact references (model checkpoint paths, confusion matrix images). Schema: `runs` (run_id PK, experiment_id FK, status, created_at, git_commit, params JSONB) and `metrics` (run_id, step, name, value, logged_at). Index on (experiment_id, name, step) for efficient comparison queries. Expose a comparison API that pivots the top-N runs by a target metric. Systems like MLflow or Weights & Biases implement this pattern; in a custom design, back the metrics table with a columnar store (e.g., ClickHouse) for fast aggregation across thousands of runs.”
}
},
{
“@type”: “Question”,
“name”: “How do you design a model registry that supports promotion gates and rollback?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A model registry stores versioned model artifacts and their lifecycle state. Schema: `registered_models` (name PK), `model_versions` (model_name, version INT, run_id FK, artifact_uri, stage ENUM(‘Staging’,’Production’,’Archived’), created_at, promoted_by). Only one version per model name may occupy ‘Production’ at a time; enforced by a partial unique index on (model_name, stage) WHERE stage=’Production’. Promotion requires passing automated gates: evaluation job runs on a held-out dataset and compares metrics against the current production version; gate passes only if the challenger meets a minimum threshold and does not regress on key slices. Rollback is a stage transition: set the previous production version back to ‘Production’ and archive the bad version — no artifact deletion, preserving auditability.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle pipeline failures and partial reruns without reprocessing the entire dataset?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use checkpointing at DAG task boundaries: each task writes its output to a deterministic URI (e.g., s3://bucket/pipeline-id/step-name/dataset-hash/output/) and records completion in a `task_runs` table (pipeline_run_id, task_id, input_hash, output_uri, status). On rerun, the orchestrator computes the input hash for each task; if a completed record exists with a matching hash, the task is skipped and its cached output URI is passed downstream. This is content-addressed caching — identical inputs always produce the same output URI, so reruns after transient failures resume from the last successful checkpoint. For tasks with non-deterministic outputs (e.g., random sampling), store the random seed as part of the input hash. Alert on task SLA breaches using a heartbeat timeout rather than polling.”
}
}
]
}