MLOps Interview Questions: Pipelines, Monitoring, and Deployment

MLOps interviews test whether you can build and maintain ML systems in production — not just train models in notebooks. Companies like Google, Meta, Uber, Airbnb, and any company with a mature ML platform ask these questions. Expect MLOps questions alongside ML system design.

What MLOps Interviewers Are Testing

  • Can you design a training pipeline that is reproducible and version-controlled?
  • Do you understand the difference between online (real-time) and offline (batch) inference patterns?
  • Can you describe a model deployment strategy that minimizes risk?
  • Do you know how to monitor models in production and trigger retraining?

Training Pipelines

Q: How do you ensure reproducibility in ML training?

Reproducibility requires locking every source of randomness and versioning every artifact:

import torch
import numpy as np
import random
import mlflow

def setup_reproducible_training(seed=42):
    """Lock all sources of randomness."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # Slower but deterministic

def train_with_tracking(model, train_data, config):
    """Log all artifacts needed to reproduce this run."""
    with mlflow.start_run() as run:
        # Log hyperparameters
        mlflow.log_params(config)

        # Log code version
        mlflow.log_param('git_commit', get_git_hash())

        # Log dataset version
        mlflow.log_param('data_version', train_data.version)
        mlflow.log_artifact(train_data.schema_path)

        # Train model
        model, metrics = train_model(model, train_data, config)

        # Log metrics
        mlflow.log_metrics(metrics)

        # Log model with signature
        signature = mlflow.models.infer_signature(
            train_data.sample_input, model.predict(train_data.sample_input)
        )
        mlflow.pytorch.log_model(model, 'model', signature=signature)

        return run.info.run_id

Q: What is a feature store and why do you need one?

A feature store solves two problems:

  • Training-serving skew: Features computed differently at training time vs serving time cause silent model degradation. A feature store ensures the same computation runs in both contexts.
  • Feature reuse: User embedding features computed for the recommendation model can be reused by the fraud detection model without recomputing.

Architecture:

  • Offline store: S3 or Delta Lake — historical feature values for training
  • Online store: Redis or DynamoDB — latest feature values for serving (<10ms lookup)
  • Feature computation: Spark batch jobs populate offline store; Kafka + Flink stream jobs populate online store

Examples: Feast (open source), Tecton (managed), Vertex AI Feature Store (GCP).

Model Deployment

Q: Compare canary deployment, blue-green deployment, and shadow mode for ML models.

Strategy How it works Best for Risk
Shadow mode New model runs alongside current; predictions logged but not served Validating correctness before any traffic Zero — users never see new model
Canary Route N% of traffic to new model; ramp up if metrics hold Gradual rollout with real user feedback Low — easy to rollback
Blue-green Two identical environments; switch DNS to new after validation Instant cutover after offline validation Higher if rollback is slow
A/B test Split traffic between models; run statistical significance test Measuring business impact, not just technical metrics Experiment pollution if not isolated

Recommended sequence: Shadow mode (1 week) → Canary 5% (2 days) → Canary 25% (2 days) → Full rollout.

Q: How do you serve ML models at low latency?

  • Model format: Export to ONNX for cross-platform inference; TensorRT for GPU-optimized serving
  • Batching: Dynamic batching — accumulate requests for 10-20ms; send as one batch; reduces GPU overhead
  • Model quantization: INT8 quantization gives 3-4x speedup with <1% accuracy drop on most models
  • Caching: Cache predictions for identical inputs (deterministic models, immutable features)
  • Hardware: GPU for deep models; CPU for tree-based models (XGBoost, LightGBM serve faster on CPU than GPU)
import onnxruntime as ort
import numpy as np

class OptimizedModelServer:
    def __init__(self, model_path: str, num_threads: int = 4):
        # Configure for low-latency CPU inference
        session_options = ort.SessionOptions()
        session_options.inter_op_num_threads = num_threads
        session_options.intra_op_num_threads = num_threads
        session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
        session_options.graph_optimization_level = (
            ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        )

        self.session = ort.InferenceSession(
            model_path,
            session_options,
            providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
        )

    def predict(self, features: np.ndarray) -> np.ndarray:
        input_name = self.session.get_inputs()[0].name
        return self.session.run(None, {input_name: features})[0]

    def predict_batch(self, feature_list: list) -> list:
        """Batch multiple requests for efficient GPU utilization."""
        batch = np.stack(feature_list)
        scores = self.predict(batch)
        return scores.tolist()

CI/CD for ML

Q: What does a CI/CD pipeline for ML look like?

PR → Code Review
  ↓
Automated checks:
  - Unit tests: data validation, feature computation
  - Integration tests: model training on small dataset
  - Data quality tests: schema checks, distribution checks
  ↓
Training job trigger (if model code changed):
  - Train on full dataset
  - Run offline evaluation: AUC, NDCG, RMSE
  - Compare to current champion model
  ↓
If metrics pass threshold:
  - Register model in model registry (MLflow, W&B)
  - Deploy to staging: run shadow mode test
  ↓
If shadow test passes:
  - Canary deployment: 5% traffic
  - Monitor for 24-48 hours
  ↓
Full deployment or rollback

Q: How do you handle data validation in ML pipelines?

import great_expectations as ge

def validate_training_data(df):
    """Validate data quality before training."""
    gdf = ge.from_pandas(df)

    # Schema validation
    gdf.expect_column_to_exist('user_id')
    gdf.expect_column_to_exist('label')
    gdf.expect_column_values_to_not_be_null('label')

    # Distribution validation
    gdf.expect_column_values_to_be_between('age', min_value=0, max_value=120)
    gdf.expect_column_mean_to_be_between('purchase_amount', min_value=10, max_value=500)

    # Completeness
    gdf.expect_column_values_to_not_be_null('user_features', mostly=0.95)

    # Label distribution (detect label drift)
    positive_rate = df['label'].mean()
    if not (0.01 <= positive_rate <= 0.50):
        raise ValueError(f"Unexpected positive label rate: {positive_rate:.3f}")

    result = gdf.validate()
    if not result['success']:
        failed = [r for r in result['results'] if not r['success']]
        raise ValueError(f"Data validation failed: {failed}")

    return True

Common MLOps Interview Questions

Q: What is training-serving skew and how do you prevent it?

Training-serving skew occurs when features are computed differently at training vs serving time. Common causes:

  • Using raw data at training time but transformed data at serving (different normalization)
  • Time-based features computed with future data at training (data leakage)
  • Different code paths: Python pandas at training, Java/Go at serving

Prevention: Use a feature store with a shared computation layer. Run the same feature computation code (via WASM, JVM, or gRPC service) in both training and serving.

Q: When would you use batch inference vs. online inference?

Pattern Use when Examples
Batch (offline) inference Predictions can be precomputed; low latency not required Email campaign targeting, next-day churn risk scores
Near-real-time (streaming) Features need to be recent but not instant Fraud detection with 1-minute feature freshness
Online (synchronous) Must respond to user action in real time Search ranking, real-time recommendation, ad serving

Q: What is model versioning and why does it matter?

Track: model weights, hyperparameters, training data version, code commit, evaluation metrics. This enables:

  • Rollback if new model regresses in production
  • Audit trail for regulated industries (finance, healthcare)
  • Reproducibility for debugging production failures
  • Lineage tracking: which training data produced which model

Tools: MLflow Model Registry, Weights & Biases, DVC, Vertex AI Model Registry.

Depth Levels

Junior: Explain training vs serving, describe what a feature store is, name deployment strategies.

Senior: Design a full CI/CD pipeline for ML, implement data validation, describe shadow mode deployment.

Staff: Multi-model orchestration with dependencies, distributed training with fault tolerance, cost optimization (spot instances, model distillation for cheaper serving), regulatory compliance (model cards, audit logs).

Related ML Topics

Scroll to Top