MLOps interviews test whether you can build and maintain ML systems in production — not just train models in notebooks. Companies like Google, Meta, Uber, Airbnb, and any company with a mature ML platform ask these questions. Expect MLOps questions alongside ML system design.
What MLOps Interviewers Are Testing
- Can you design a training pipeline that is reproducible and version-controlled?
- Do you understand the difference between online (real-time) and offline (batch) inference patterns?
- Can you describe a model deployment strategy that minimizes risk?
- Do you know how to monitor models in production and trigger retraining?
Training Pipelines
Q: How do you ensure reproducibility in ML training?
Reproducibility requires locking every source of randomness and versioning every artifact:
import torch
import numpy as np
import random
import mlflow
def setup_reproducible_training(seed=42):
"""Lock all sources of randomness."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # Slower but deterministic
def train_with_tracking(model, train_data, config):
"""Log all artifacts needed to reproduce this run."""
with mlflow.start_run() as run:
# Log hyperparameters
mlflow.log_params(config)
# Log code version
mlflow.log_param('git_commit', get_git_hash())
# Log dataset version
mlflow.log_param('data_version', train_data.version)
mlflow.log_artifact(train_data.schema_path)
# Train model
model, metrics = train_model(model, train_data, config)
# Log metrics
mlflow.log_metrics(metrics)
# Log model with signature
signature = mlflow.models.infer_signature(
train_data.sample_input, model.predict(train_data.sample_input)
)
mlflow.pytorch.log_model(model, 'model', signature=signature)
return run.info.run_id
Q: What is a feature store and why do you need one?
A feature store solves two problems:
- Training-serving skew: Features computed differently at training time vs serving time cause silent model degradation. A feature store ensures the same computation runs in both contexts.
- Feature reuse: User embedding features computed for the recommendation model can be reused by the fraud detection model without recomputing.
Architecture:
- Offline store: S3 or Delta Lake — historical feature values for training
- Online store: Redis or DynamoDB — latest feature values for serving (<10ms lookup)
- Feature computation: Spark batch jobs populate offline store; Kafka + Flink stream jobs populate online store
Examples: Feast (open source), Tecton (managed), Vertex AI Feature Store (GCP).
Model Deployment
Q: Compare canary deployment, blue-green deployment, and shadow mode for ML models.
| Strategy | How it works | Best for | Risk |
|---|---|---|---|
| Shadow mode | New model runs alongside current; predictions logged but not served | Validating correctness before any traffic | Zero — users never see new model |
| Canary | Route N% of traffic to new model; ramp up if metrics hold | Gradual rollout with real user feedback | Low — easy to rollback |
| Blue-green | Two identical environments; switch DNS to new after validation | Instant cutover after offline validation | Higher if rollback is slow |
| A/B test | Split traffic between models; run statistical significance test | Measuring business impact, not just technical metrics | Experiment pollution if not isolated |
Recommended sequence: Shadow mode (1 week) → Canary 5% (2 days) → Canary 25% (2 days) → Full rollout.
Q: How do you serve ML models at low latency?
- Model format: Export to ONNX for cross-platform inference; TensorRT for GPU-optimized serving
- Batching: Dynamic batching — accumulate requests for 10-20ms; send as one batch; reduces GPU overhead
- Model quantization: INT8 quantization gives 3-4x speedup with <1% accuracy drop on most models
- Caching: Cache predictions for identical inputs (deterministic models, immutable features)
- Hardware: GPU for deep models; CPU for tree-based models (XGBoost, LightGBM serve faster on CPU than GPU)
import onnxruntime as ort
import numpy as np
class OptimizedModelServer:
def __init__(self, model_path: str, num_threads: int = 4):
# Configure for low-latency CPU inference
session_options = ort.SessionOptions()
session_options.inter_op_num_threads = num_threads
session_options.intra_op_num_threads = num_threads
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
session_options.graph_optimization_level = (
ort.GraphOptimizationLevel.ORT_ENABLE_ALL
)
self.session = ort.InferenceSession(
model_path,
session_options,
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
def predict(self, features: np.ndarray) -> np.ndarray:
input_name = self.session.get_inputs()[0].name
return self.session.run(None, {input_name: features})[0]
def predict_batch(self, feature_list: list) -> list:
"""Batch multiple requests for efficient GPU utilization."""
batch = np.stack(feature_list)
scores = self.predict(batch)
return scores.tolist()
CI/CD for ML
Q: What does a CI/CD pipeline for ML look like?
PR → Code Review ↓ Automated checks: - Unit tests: data validation, feature computation - Integration tests: model training on small dataset - Data quality tests: schema checks, distribution checks ↓ Training job trigger (if model code changed): - Train on full dataset - Run offline evaluation: AUC, NDCG, RMSE - Compare to current champion model ↓ If metrics pass threshold: - Register model in model registry (MLflow, W&B) - Deploy to staging: run shadow mode test ↓ If shadow test passes: - Canary deployment: 5% traffic - Monitor for 24-48 hours ↓ Full deployment or rollback
Q: How do you handle data validation in ML pipelines?
import great_expectations as ge
def validate_training_data(df):
"""Validate data quality before training."""
gdf = ge.from_pandas(df)
# Schema validation
gdf.expect_column_to_exist('user_id')
gdf.expect_column_to_exist('label')
gdf.expect_column_values_to_not_be_null('label')
# Distribution validation
gdf.expect_column_values_to_be_between('age', min_value=0, max_value=120)
gdf.expect_column_mean_to_be_between('purchase_amount', min_value=10, max_value=500)
# Completeness
gdf.expect_column_values_to_not_be_null('user_features', mostly=0.95)
# Label distribution (detect label drift)
positive_rate = df['label'].mean()
if not (0.01 <= positive_rate <= 0.50):
raise ValueError(f"Unexpected positive label rate: {positive_rate:.3f}")
result = gdf.validate()
if not result['success']:
failed = [r for r in result['results'] if not r['success']]
raise ValueError(f"Data validation failed: {failed}")
return True
Common MLOps Interview Questions
Q: What is training-serving skew and how do you prevent it?
Training-serving skew occurs when features are computed differently at training vs serving time. Common causes:
- Using raw data at training time but transformed data at serving (different normalization)
- Time-based features computed with future data at training (data leakage)
- Different code paths: Python pandas at training, Java/Go at serving
Prevention: Use a feature store with a shared computation layer. Run the same feature computation code (via WASM, JVM, or gRPC service) in both training and serving.
Q: When would you use batch inference vs. online inference?
| Pattern | Use when | Examples |
|---|---|---|
| Batch (offline) inference | Predictions can be precomputed; low latency not required | Email campaign targeting, next-day churn risk scores |
| Near-real-time (streaming) | Features need to be recent but not instant | Fraud detection with 1-minute feature freshness |
| Online (synchronous) | Must respond to user action in real time | Search ranking, real-time recommendation, ad serving |
Q: What is model versioning and why does it matter?
Track: model weights, hyperparameters, training data version, code commit, evaluation metrics. This enables:
- Rollback if new model regresses in production
- Audit trail for regulated industries (finance, healthcare)
- Reproducibility for debugging production failures
- Lineage tracking: which training data produced which model
Tools: MLflow Model Registry, Weights & Biases, DVC, Vertex AI Model Registry.
Depth Levels
Junior: Explain training vs serving, describe what a feature store is, name deployment strategies.
Senior: Design a full CI/CD pipeline for ML, implement data validation, describe shadow mode deployment.
Staff: Multi-model orchestration with dependencies, distributed training with fault tolerance, cost optimization (spot instances, model distillation for cheaper serving), regulatory compliance (model cards, audit logs).
Related ML Topics
- How to Detect Model Drift in Production — model drift is the primary trigger for retraining pipelines; PSI and performance monitoring feed back into the CI/CD loop
- ML System Design: Build a Fraud Detection System — fraud models have weekly retraining cycles; the champion/challenger deployment pattern and shadow mode are essential
- ML System Design: Build a Spam Classifier — spam classification retraining triggered by spammer adaptation; same CI/CD pipeline with adversarial test cases
- Design a Monitoring and Alerting System — MLOps monitoring is built on the same Prometheus/Grafana/alerting infrastructure as system monitoring; model metrics are just another time series
- ML System Design: Build a Search Ranking System — search ranking CI/CD is among the most complex: A/B testing, interleaved evaluation, position bias in training data all affect the pipeline