Question 1

How do you design the data preprocessing stage of an ML pipeline to be both reproducible and scalable?

Accepted Answer

Represent each preprocessing step as a versioned, stateless transformation function that reads from an immutable input dataset (identified by a content hash or versioned URI in object storage) and writes to a new immutable output dataset. Orchestrate steps as a DAG (e.g., via Apache Airflow or Kubeflow Pipelines) so reruns are idempotent. Store transformation code alongside a requirements manifest in a container image tagged with a digest. Feature stores (e.g., Feast) serve as the boundary between preprocessing and training: computed features are written once and served consistently to both training and online inference, eliminating training-serving skew. Log input/output dataset URIs and code image digests to the experiment tracker on every run.

Question 2

What data does an experiment tracker need to capture, and how should it be queried?

Accepted Answer

An experiment tracker must capture: (1) hyperparameters as a JSON map, (2) metrics time-series (step, metric_name, value) for loss curves and eval scores, (3) dataset version URIs and code commit SHA, (4) hardware config (GPU type, count), and (5) artifact references (model checkpoint paths, confusion matrix images). Schema: `runs` (run_id PK, experiment_id FK, status, created_at, git_commit, params JSONB) and `metrics` (run_id, step, name, value, logged_at). Index on (experiment_id, name, step) for efficient comparison queries. Expose a comparison API that pivots the top-N runs by a target metric. Systems like MLflow or Weights & Biases implement this pattern; in a custom design, back the metrics table with a columnar store (e.g., ClickHouse) for fast aggregation across thousands of runs.

Question 3

How do you design a model registry that supports promotion gates and rollback?

Accepted Answer

A model registry stores versioned model artifacts and their lifecycle state. Schema: `registered_models` (name PK), `model_versions` (model_name, version INT, run_id FK, artifact_uri, stage ENUM('Staging','Production','Archived'), created_at, promoted_by). Only one version per model name may occupy 'Production' at a time; enforced by a partial unique index on (model_name, stage) WHERE stage='Production'. Promotion requires passing automated gates: evaluation job runs on a held-out dataset and compares metrics against the current production version; gate passes only if the challenger meets a minimum threshold and does not regress on key slices. Rollback is a stage transition: set the previous production version back to 'Production' and archive the bad version — no artifact deletion, preserving auditability.

Question 4

How do you handle pipeline failures and partial reruns without reprocessing the entire dataset?

Accepted Answer

Use checkpointing at DAG task boundaries: each task writes its output to a deterministic URI (e.g., s3://bucket/pipeline-id/step-name/dataset-hash/output/) and records completion in a `task_runs` table (pipeline_run_id, task_id, input_hash, output_uri, status). On rerun, the orchestrator computes the input hash for each task; if a completed record exists with a matching hash, the task is skipped and its cached output URI is passed downstream. This is content-addressed caching — identical inputs always produce the same output URI, so reruns after transient failures resume from the last successful checkpoint. For tasks with non-deterministic outputs (e.g., random sampling), store the random seed as part of the input hash. Alert on task SLA breaches using a heartbeat timeout rather than polling.

ML Training Pipeline Low-Level Design: Data Preprocessing, Experiment Tracking, and Model Registry

Pipeline Stages Overview

Data Versioning

Feature Engineering

Distributed Training

Experiment Tracking

Hyperparameter Tuning

Evaluation Metrics

Model Registry

Promotion Workflow and Serving Deployment

Drift Detection