How do you detect when a deployed ML model is degrading?

Monitor four signals: (1) Prediction distribution: if the model suddenly predicts 90% positive (was 50%), something is wrong. Track prediction histograms over time. (2) Feature distribution (data drift): compare input features against training distribution using KS test or Population Stability Index. Significant drift means the model may be unreliable. (3) Performance metrics: if labels are available (with delay), track accuracy/F1/AUC over time. A drop indicates degradation. For delayed labels (recommendations, search), monitor proxy metrics (CTR) in real-time. (4) Latency and errors: serving latency spikes or error rate increases. When drift or degradation is detected: alert the ML team, trigger retraining pipeline with recent data, and if severe, roll back to the previous model version. Tools: Evidently AI (drift), WhyLabs (monitoring), Arize (observability), or custom Grafana dashboards.

AI/ML Interview: MLOps — Model Deployment, Feature Store, Experiment Tracking, Model Monitoring, A/B Testing, Pipeline

Q: What is a feature store and why do ML teams need one?

A feature store solves three problems: (1) Training-serving skew: features computed differently during training (batch Spark on historical data) vs serving (real-time different code). The feature store ensures identical computation in both contexts. (2) Feature reuse: multiple models using the same features (user purchase count) -- without a store, each team recomputes independently (duplicated work, inconsistent definitions). (3) Point-in-time correctness: training features must reflect state at each example timestamp, not current state. Using current features for historical examples is data leakage. Architecture: offline store (batch features in data warehouse) for training. Online store (Redis/DynamoDB) for serving. A materialization job copies features from offline to online keyed by entity (user_id). Tools: Feast (open-source), Tecton (managed), or cloud-native (SageMaker/Vertex Feature Store).

⏱ 6 min read

MLOps (Machine Learning Operations) bridges the gap between training a model in a notebook and running it reliably in production. Most ML models never make it to production — not because of model quality, but because of engineering challenges. Understanding MLOps is essential for ML engineering interviews at any company deploying ML at scale. This guide covers the systems that make ML production-ready.

ML Pipeline Architecture

A production ML pipeline automates: data ingestion, feature engineering, training, evaluation, and deployment. Pipeline stages: (1) Data ingestion — pull raw data from sources (databases, event streams, data lakes). Validate schema and quality (no null columns, expected distributions). (2) Feature engineering — transform raw data into model features. Compute aggregations (user purchase count in last 30 days), encode categoricals (one-hot, label encoding), normalize numerics, and join multiple data sources. (3) Training — train the model on the feature dataset. Hyperparameter tuning (grid search, Bayesian optimization). Track experiments (parameters, metrics, artifacts) in an experiment tracker. (4) Evaluation — compute metrics on a held-out test set. Compare against the current production model. If the new model is better: promote to staging. (5) Deployment — deploy the model to a serving infrastructure. Canary deployment: serve 5% of traffic with the new model, 95% with the old. Monitor metrics. If healthy: ramp to 100%. Pipeline orchestration: Airflow, Kubeflow Pipelines, or Vertex AI Pipelines schedule and manage the pipeline. A re-training pipeline runs daily/weekly: ingest new data, retrain, evaluate, and deploy if improved. This prevents model staleness (model performance degrades as the data distribution shifts over time).

Feature Store

A feature store is a centralized repository for ML features, solving: (1) Training-serving skew — features computed differently during training (batch, using Spark on historical data) and serving (real-time, using a different code path). The feature store ensures the same feature computation logic is used in both contexts. (2) Feature reuse — multiple models may use the same features (user purchase count, product embedding). Without a feature store, each team recomputes features independently (duplicated work, inconsistent definitions). (3) Point-in-time correctness — training features must reflect the state at the time of each training example (not the current state). Using current features for historical examples is data leakage. The feature store handles time-travel queries: “what was user X purchase count on March 1, 2026?” Architecture: offline store (batch features computed by Spark/Flink, stored in a data warehouse or Parquet files on S3) for training. Online store (low-latency key-value store: Redis, DynamoDB) for serving. A materialization job copies features from offline to online, keyed by entity (user_id, product_id). Tools: Feast (open-source), Tecton (managed), Vertex AI Feature Store (GCP), SageMaker Feature Store (AWS). Feature definition: feature = user_purchase_count_30d, entity = user_id, aggregation = COUNT of purchases WHERE timestamp > now – 30 days.

Experiment Tracking

Experiment tracking records every training run: parameters, metrics, code version, data version, and artifacts (model files, plots). Without tracking: “which model configuration produced the best F1 score last week?” is unanswerable. Tools: (1) MLflow — open-source. Tracks experiments (parameters + metrics), manages model registry (staging, production, archived), and packages models for deployment (MLflow Models). (2) Weights and Biases (W&B) — SaaS. Rich experiment dashboard with real-time training visualization, hyperparameter sweeps, and team collaboration. The standard for research teams. (3) Neptune.ai — SaaS. Similar to W&B with strong metadata management. What to track per experiment: hyperparameters (learning_rate, batch_size, epochs, model_architecture), training metrics (loss, accuracy, F1, AUC per epoch), evaluation metrics on test set, dataset version (hash of the training data), code version (git commit SHA), environment (Python version, package versions), and artifacts (model checkpoint, confusion matrix plot, feature importance). Model registry: after training, the best model is registered with a version (v1, v2). Stages: staging (being validated), production (serving traffic), and archived (retired). Promoting a model from staging to production triggers the deployment pipeline.

Model Serving and A/B Testing

Model serving: the model runs as a service responding to prediction requests. Architecture: (1) Real-time serving — a REST/gRPC API. The request includes features; the response includes the prediction. Frameworks: TorchServe (PyTorch), TensorFlow Serving, Triton Inference Server (NVIDIA, supports multiple frameworks), and BentoML (framework-agnostic, easy deployment). Latency target: under 50ms P99 for most applications. (2) Batch serving — process a large dataset of prediction requests at once (nightly recommendations, daily fraud scoring). Use Spark or a batch job with the model loaded in memory. (3) Edge serving — run the model on the device (mobile, IoT). Use ONNX Runtime, TensorFlow Lite, or Core ML. A/B testing for models: deploy the new model to a treatment group (10% of users). The control group (90%) uses the current model. Compare business metrics (click-through rate, conversion, revenue) between groups. Statistical significance: run the test for 1-2 weeks with sufficient traffic. If the treatment outperforms control with p < 0.05: promote the new model to 100%. Shadow mode: before A/B testing, run the new model in shadow mode — it processes real traffic and logs predictions but does not serve them to users. Compare shadow predictions with the production model. If they diverge significantly, investigate before A/B testing. This catches bugs without user impact.

Model Monitoring

Models degrade over time as the data distribution changes (data drift). Monitoring detects degradation before it impacts users. What to monitor: (1) Prediction distribution — if the model suddenly predicts 90% positive (was 50%), something is wrong. Track prediction histograms over time. (2) Feature distribution (data drift) — compare the distribution of input features against the training distribution. Kolmogorov-Smirnov test or Population Stability Index (PSI) detect drift. If a feature distribution shifts significantly, the model may produce unreliable predictions. (3) Performance metrics — if labels are available (with delay): track accuracy, F1, AUC over time. A drop indicates model degradation. For many applications (recommendations, search ranking), labels arrive with delay (did the user click? did they purchase?). Monitor proxy metrics (click-through rate) in real-time and true metrics (conversion) with delay. (4) Latency and errors — model serving latency, error rate, and throughput. Alert on: latency spike (model or infrastructure issue), error rate increase, and prediction distribution shift. Alerting: when drift or performance degradation is detected: (1) Alert the ML team. (2) Trigger a retraining pipeline with recent data. (3) If severe: roll back to the previous model version. Tools: Evidently AI (open-source drift detection), WhyLabs (managed monitoring), Arize (ML observability), and custom dashboards in Grafana with Prometheus metrics from the serving layer.