As machine learning moves from research to production, companies need platforms that manage the full ML lifecycle: data ingestion, feature engineering, training, evaluation, deployment, and monitoring. Designing an ML platform is a key interview topic at companies like Meta, Google, Uber, and any organization running models at scale.
The ML Lifecycle
Data → Features → Training → Evaluation → Deployment → Monitoring
↑ │
└──────────── feedback loop (retraining signals) ──────────┘
Feature Store: The Central Hub
A feature store solves the training-serving skew problem — the same feature computation logic must produce identical results during training (offline) and serving (online).
Feature Store Architecture:
Offline Store (batch, historical)
├── Source: Hive, Spark, dbt pipeline
├── Storage: S3 + Parquet (point-in-time correct joins)
└── Use: model training, offline evaluation
Online Store (low-latency, real-time)
├── Source: Kafka stream → Flink materialization
├── Storage: Redis / DynamoDB (< 5ms p99 reads)
└── Use: model serving (inference time feature lookup)
Feature Registry
└── Metadata: name, type, owner, freshness SLA, lineage
Point-in-time correct join (critical for avoiding leakage):
Training data: join user features as of event timestamp
NOT current feature values → prevents future data leakage
Tools: Feast, Tecton, Hopsworks, AWS SageMaker Feature Store
Model Training Infrastructure
Distributed training patterns:
Data Parallelism (most common):
Split dataset across N GPUs
Each GPU has full model copy
Forward pass → gradients → AllReduce → update weights
Tools: PyTorch DDP, Horovod
Model Parallelism (for models too large for one GPU):
Split model layers across GPUs (pipeline parallelism)
LLaMA 70B: 80 transformer layers across 8 × A100 GPUs
Tools: DeepSpeed, Megatron-LM
Hyperparameter Tuning:
Grid search → Bayesian optimization (TPE algorithm)
→ Population-based training (PBT) for RL
Tools: Ray Tune, Optuna, Weights & Biases Sweeps
Experiment Tracking (mandatory):
Log: hyperparameters, metrics, artifacts, code version
Tools: MLflow, Weights & Biases, Comet
Schema: run_id, experiment_id, params {lr, batch_size},
metrics {train_loss, val_auc per epoch}, model artifact
Model Registry and Versioning
Model lifecycle stages:
Staging → Candidate → Production → Archived
Model registry entry:
{
"model_name": "fraud_detector",
"version": 42,
"stage": "Production",
"training_run_id": "abc123",
"framework": "scikit-learn 1.3",
"metrics": {"val_auc": 0.953, "val_f1": 0.881},
"features": ["amount", "merchant_category", "user_30d_txn_count"],
"training_dataset": "s3://ml-data/fraud/2024-01-01/",
"registered_at": "2024-01-15T10:30:00Z",
"promoted_by": "alice@company.com"
}
Promotion workflow:
1. Train model → log to experiment tracker
2. Pass offline evaluation thresholds → register in staging
3. A/B test in shadow mode (offline comparison vs champion)
4. Shadow passed → promote to Candidate
5. Online A/B test (traffic split) → promote to Production
6. Old model → Archived (retained for rollback)
Model Serving Architecture
Batch inference (offline predictions):
Trigger: nightly Airflow job
Input: feature table in S3 (yesterday's user features)
Output: prediction table (user_id → score) in S3 → Redis
Latency: hours acceptable; throughput is key metric
Tools: Spark MLlib, SageMaker Batch Transform, Ray Batch
Real-time inference (online serving):
Request → Feature Store lookup (Redis, < 5ms)
→ Model Server (TF Serving / Triton / TorchServe)
→ Post-processing (threshold, calibration)
→ Response
Latency: < 20ms p99 target
Scaling: k8s HPA on GPU utilization or request queue depth
Model server optimizations:
- Model quantization: FP32 → INT8 (4× smaller, ~2× faster, ~1% accuracy drop)
- Batching: collect N requests → single GPU forward pass (amortize overhead)
- ONNX: convert from PyTorch/TF → unified runtime
- Model caching: warm model in GPU memory (cold start = seconds)
A/B Testing and Shadow Mode
Shadow mode (safe challenger evaluation):
All requests → Champion model → response to user
All requests → Challenger model → prediction logged (not used)
Offline: compare champion vs challenger on same inputs
A/B test (traffic split for online evaluation):
10% traffic → Challenger (new model)
90% traffic → Champion (current model)
Track: CTR, conversion rate, revenue, long-term engagement
Duration: 2+ weeks (statistical significance + seasonality)
Metrics hierarchy:
Primary: Business metric (revenue, CTR, D7 retention)
Secondary: Model metric (AUC, precision, recall)
Guardrail: Latency p99, error rate, cost-per-prediction
ML Monitoring: Detecting Drift
Types of drift:
Data drift: input feature distribution changes
e.g., user age distribution shifts after new market launch
Concept drift: relationship between features and label changes
e.g., fraud patterns change, model predictions stale
Label drift: outcome distribution changes
e.g., click-through rate drops across the board
Detection methods:
PSI (Population Stability Index): compare feature distributions
PSI 0.2: major shift (retrain)
KS test: Kolmogorov-Smirnov statistic for continuous features
Chi-squared: for categorical features
Model output monitoring:
Track prediction score distribution daily
Alert if mean score shifts > 2 standard deviations
Monitoring stack:
Model → log predictions + features → Kafka
→ Flink: compute drift metrics per feature
→ Time-series DB (Prometheus/InfluxDB)
→ Grafana dashboard + alert if PSI > threshold
→ Trigger retraining pipeline
Retraining Strategy
| Strategy | Trigger | Cost | Best For |
|---|---|---|---|
| Scheduled | Weekly/monthly cron | Low | Stable, slow-changing domains |
| Triggered | Drift detected (PSI > threshold) | Medium | Dynamic environments |
| Continuous | New data available (streaming) | High | Real-time personalization, fraud |
ML Platform Component Summary
Data Layer: Kafka → Flink → Feature Store (offline: S3, online: Redis)
Training: Airflow DAG → Spark / PyTorch DDP → MLflow (tracking)
Registry: Model Registry (staging → production pipeline)
Serving: Triton / TF Serving → k8s HPA → < 20ms p99
Monitoring: Prediction logs → drift detection → alert → retrain trigger
Orchestration: Airflow / Kubeflow Pipelines / Metaflow (pipelines as code)
Interview Discussion Points
- Training/serving skew: The #1 production ML bug. Same feature code must run in training and serving. Feature store enforces this by being the single source of feature logic. Without it, teams independently implement features and diverge.
- Online vs batch features: Some features require real-time computation (user’s last 5 actions), others are batch (user lifetime value). Hybrid feature stores serve both from a unified API — online for real-time, offline for training — hiding the implementation difference from model code.
- Model rollback: Always retain the previous champion model in the registry. Rollback = update serving config to point to previous version. Should complete in < 5 minutes. Canary deployment (5% → 20% → 100% traffic) enables early detection of regressions.
- Cold start in ML serving: Loading a large model (GPT-scale) from disk takes 30-120 seconds. Mitigate with: keep model warm in GPU memory, use smaller distilled models for latency-critical paths, preload on startup, readiness probe gates traffic until model loaded.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a feature store and why is it important for ML systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A feature store is a centralized repository for ML features that solves two critical problems: (1) Training-serving skew u2014 without a feature store, data scientists compute features in Python/Spark for training, and engineers re-implement the same logic in Java/Go for serving, leading to subtle differences that degrade model performance in production. A feature store ensures the same feature computation logic runs in both environments. (2) Feature reuse u2014 instead of each team recomputing the same features (user purchase history, merchant category statistics), the feature store computes them once and makes them available to all models. Feature stores have an offline component (historical features for training, backed by S3/Hive) and an online component (real-time features for serving, backed by Redis/DynamoDB).”
}
},
{
“@type”: “Question”,
“name”: “How do you detect and handle model drift in production ML systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Model drift occurs when the statistical properties of model inputs or outputs change over time, degrading accuracy. Detect it by monitoring: data drift (feature distribution changes, measured via PSI or KS test u2014 PSI > 0.2 indicates major shift), concept drift (relationship between features and labels changes u2014 model predictions stay stable but business outcomes degrade), and output drift (prediction score distribution shifts). Log all model inputs and predictions, compute drift metrics daily via a Flink/Spark job, and alert when drift exceeds thresholds. Response: trigger retraining on labeled recent data. Automated retraining pipelines (Airflow + MLflow) reduce mean-time-to-recovery from days to hours.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between batch inference and real-time inference, and when should you use each?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Batch inference runs predictions offline on a large dataset and stores results for later lookup u2014 suitable when predictions can be precomputed (e.g., daily personalized email content, weekly risk scores). It uses cheaper CPU compute, handles arbitrary model size/complexity, and scales via Spark/Ray. Real-time inference serves predictions at request time within a latency budget (< 20ms p99) u2014 required when predictions depend on real-time context (e.g., fraud detection on a live transaction, personalized search results). It requires model servers (Triton, TF Serving) on GPU instances, model optimizations (quantization, batching), and a feature store for fast feature lookup. Many systems combine both: batch predictions cached in Redis as defaults, real-time inference for high-value or context-sensitive requests."
}
}
]
}