What is the difference between data parallelism and model parallelism in distributed training?

Data parallelism splits the training dataset across multiple workers, each holding a full copy of the model and computing gradients on its shard; gradients are then aggregated via all-reduce. Model parallelism splits the model itself across devices, routing activations between them, and is used when the model is too large to fit on a single accelerator. In practice, large-scale systems combine both strategies (hybrid parallelism) alongside pipeline parallelism to maximize hardware utilization.

How do you design a feature store to eliminate training-serving skew?

Training-serving skew is eliminated by enforcing a single feature-computation path used by both pipelines. The feature store materializes features from a shared transformation layer: an offline store (e.g., Hive, Delta Lake) for training batch retrieval and an online store (e.g., Redis, DynamoDB) for low-latency serving. Point-in-time correct joins prevent label leakage. Feature versioning and a central registry ensure the same feature definition is consumed at training time and inference time, making drift structurally impossible.

How do you implement experiment tracking for ML reproducibility?

Reproducibility requires capturing every artifact that influences a run: code commit hash, dataset version and lineage, hyperparameters, environment (Docker image or conda env), random seeds, and output metrics. Tools like MLflow, Weights & Biases, or a custom tracking service log these atomically per run. Immutable artifact storage (S3 with versioning) ensures checkpoints and datasets are not mutated. A run ID links all artifacts so any experiment can be exactly re-executed from its metadata record.

What does a model registry promotion workflow with metric gates look like?

A model registry stores versioned model artifacts with associated metadata and lifecycle stages (Staging, Production, Archived). Promotion from Staging to Production is gated by automated metric checks: the candidate model must meet or exceed thresholds on offline metrics (AUC, RMSE) and optionally pass shadow or canary traffic evaluation. CI/CD pipelines run evaluation jobs that call the registry API to transition model stage only when all gates pass. Human approval steps can be added for high-stakes models. Rollback reverts the Production pointer to the previous version.

How does Bayesian optimization compare to random search for hyperparameter tuning?

Random search samples hyperparameter configurations independently and is embarrassingly parallel, making it simple to implement and scale. Bayesian optimization builds a probabilistic surrogate model (typically a Gaussian process or Tree Parzen Estimator) of the objective function and uses an acquisition function (e.g., Expected Improvement) to select the next configuration most likely to improve results. Bayesian optimization finds better configurations with fewer trials but is sequential by nature and harder to parallelize. In practice, Bayesian methods are preferred when each trial is expensive (long training runs), while random search is competitive when many trials can run in parallel cheaply.

Low Level Design: Machine Learning Platform

⏱ 8 min read

Training Job Orchestration

A training job submission API accepts the full specification of a run: the training script path, the ML framework (PyTorch, TensorFlow, JAX), compute requirements (GPU type such as A100 or H100, GPU count, CPU and memory limits), the dataset reference, and a hyperparameter map. The scheduler receives the submitted job, inspects the current cluster utilization, and assigns the job to available nodes that satisfy the resource constraints. Kubernetes is the standard execution layer, extended with custom operators (Kubeflow Training Operator, volcano) that understand distributed training semantics: they create the correct number of worker and parameter-server pods, configure collective communication endpoints, and restart individual pods on transient failure without aborting the whole job. Job state transitions (PENDING → RUNNING → SUCCEEDED / FAILED) are persisted in a relational store and surfaced via a status API so the user or an automated pipeline can poll or subscribe to completion events.

Distributed Training

Three parallelism strategies cover the space of distributed training workloads. Data parallelism replicates the full model on every GPU and splits each mini-batch across the replicas; after the forward and backward pass, gradients are synchronized with an AllReduce collective (NCCL on NVLink or InfiniBand for GPU-to-GPU bandwidth). Each replica applies the averaged gradient, keeping weights identical. This scales well when the model fits on a single GPU. Model parallelism partitions the model’s layers across GPUs when the parameter count exceeds single-device memory; each device holds a shard of the model and activations are passed between devices at layer boundaries. Tensor parallelism (Megatron-style) shards individual weight matrices across devices. Pipeline parallelism splits the layer sequence into stages assigned to different devices and fills the pipeline with micro-batches to hide the inter-stage communication latency; the 1F1B schedule minimizes in-flight activation memory. Production systems combine all three (3D parallelism) for models with hundreds of billions of parameters.

Feature Store

The feature store provides a single definition layer for feature computation that feeds both offline training and online inference, eliminating training-serving skew. The offline store materializes features as Parquet files partitioned by date in a data lake (S3, GCS); training jobs read point-in-time correct feature snapshots via a time-travel query that joins entity keys with the feature table using an "as-of" timestamp to avoid label leakage. The online store (Redis or DynamoDB) holds the latest feature value per entity key and is updated by a streaming pipeline (Kafka + Flink) that applies the same feature transformation logic used offline. The feature registry stores the transformation code, data source references, and schema, ensuring both paths execute identical logic. Feature versioning allows backward-compatible schema evolution; incompatible changes require a new feature version. Serving latency for online lookup must stay under 5 ms p99 to fit within inference SLA budgets.

Model Registry

The model registry is the source of truth for all trained artifacts. Model weights and associated files (tokenizer, preprocessor, config) are written to object storage (S3) with a content-addressed path; the registry metadata database stores the logical record: model_id (UUID), name, version (integer or semver), framework, input and output schema, evaluation metrics (accuracy, F1, AUC), the training_job_id that produced it, current status (STAGING / PRODUCTION / ARCHIVED), and free-form tags for search. A promotion workflow gates the STAGING→PRODUCTION transition: a reviewer must approve, automated metric thresholds must pass (e.g., accuracy must not regress more than 0.5% versus current production), and an integration test suite must succeed. The registry exposes a REST API consumed by serving infrastructure to resolve "production model for service X" to a specific artifact path at deploy time. Rollback is an atomic status update reverting the production pointer to the previous version.

Experiment Tracking

Every training run creates an experiment record capturing the full context needed for reproducibility and comparison: the hyperparameter map, a reference to the exact dataset version (content hash or Delta Lake commit), the code commit hash, per-epoch metrics (loss, validation accuracy, learning rate), system metrics (GPU utilization, throughput in samples/second), and links to output artifacts (checkpoints, final model, evaluation plots). MLflow is the common open-source implementation; custom services expose the same logging API. The experiment comparison UI renders metric curves across runs on the same axes, allowing engineers to spot overfitting, compare regularization strategies, or diagnose training instability. Reproducibility is enforced by re-running from the stored parameter set: the platform fetches the exact code commit, dataset version, and hyperparameters and resubmits the training job. Experiment metadata is indexed for search: "show all runs with learning_rate between 1e-4 and 1e-3 sorted by validation_loss."

Hyperparameter Tuning

Automated hyperparameter search runs multiple trials in parallel on the cluster, each trial being a full training job with a sampled hyperparameter configuration. Grid search exhaustively covers a discrete parameter grid and is practical only for small spaces. Random search samples uniformly from each parameter range and outperforms grid search in high-dimensional spaces with concentrated optima. Bayesian optimization (Tree-structured Parzen Estimator algorithm in Optuna or Hyperopt) builds a probabilistic model of the objective surface, balancing exploration of uncertain regions with exploitation of known good regions; it converges to good configurations in fewer trials than random search. Hyperband applies successive halving as an early stopping strategy: it allocates a small budget to many configurations, promotes only the top fraction, and repeats with larger budgets, eliminating poor configurations early. The tuning service tracks trial results, updates the search model after each completion, and selects the next configuration to try. The best trial’s hyperparameters are promoted for a final full-budget training run.

Data Versioning

Training dataset versioning is essential for reproducibility and audit. Each dataset snapshot is identified by a content hash computed over the constituent files; the hash is stored in the training job record. DVC (Data Version Control) layers Git-like versioning semantics over blob storage, allowing dataset versions to be checked out by hash. Delta Lake and Apache Iceberg provide ACID transactions and time-travel queries directly on data lake tables: a training job specifies a snapshot timestamp or version number, and the table format returns the exact file set that existed at that point. Dataset lineage tracking records which raw sources, transformation scripts, and pipeline runs produced each dataset version. Train/validation/test splits are generated deterministically from a fixed random seed stored in the job record; given the same dataset version and seed, the split is always identical, preventing inadvertent data leakage across re-runs.

Model Validation

Before any model transitions to production, a validation pipeline executes a battery of checks. Holdout set evaluation computes primary business metrics on a held-out dataset that was not used during training or hyperparameter search. Fairness evaluation measures performance across demographic or categorical slices defined by the business (e.g., by age group, geography, device type) to detect disparate impact. A performance regression check compares the candidate model against the current production model on the same evaluation set; promotion is blocked if the candidate underperforms by more than a configured threshold on any gating metric. For high-stakes replacements, a challenger-vs-champion A/B test routes a small fraction of live traffic to the new model and monitors online metrics (click-through rate, conversion, error rate) before full cutover. Shadow mode deployment runs the new model in parallel with production, logging its predictions without serving them, to validate behavior on real traffic distribution without user impact.