Low Level Design: ML Model Serving Service

ML Model Serving Service: Low Level Design

An ML model serving service provides versioned, low-latency online inference alongside batch prediction, safe deployment strategies, and production monitoring. This design covers the model registry, serving infrastructure, shadow and canary deployments, and drift alerting.

Data Model

ModelVersion Table

ModelVersion
------------
id              BIGINT PRIMARY KEY
model_name      VARCHAR NOT NULL
version         VARCHAR NOT NULL
framework       ENUM('pytorch','tensorflow','sklearn')
artifact_s3_key VARCHAR NOT NULL
metrics         JSONB           -- eval metrics: accuracy, AUC, RMSE, etc.
status          ENUM('staging','production','archived')
created_at      TIMESTAMP
UNIQUE (model_name, version)

InferenceRequest Log

InferenceLog
------------
id            BIGINT PRIMARY KEY
model_name    VARCHAR
version       VARCHAR
input_hash    VARCHAR   -- SHA256 of input for cache lookup
latency_ms    INT
shadow        BOOL
occurred_at   TIMESTAMP

Serving Endpoints

POST /predict/{model_name}
Body: { "inputs": { ... } }
Response: { "prediction": ..., "model_version": "1.4.2", "latency_ms": 23 }

Latency SLO: p99 < 100ms

POST /predict/batch
Body: { "model_name": "...", "s3_input_uri": "s3://bucket/input.csv" }
Response: { "job_id": "..." }

Worker Architecture

Model Loading

on worker startup:
  1. Read active ModelVersion WHERE model_name=X AND status='production'
  2. Download artifact from S3 to local disk (cached across restarts)
  3. Load model into GPU/CPU memory
  4. Mark worker as ready
  5. Subscribe to model update events → hot-reload on new production version

Worker Pool

Each model has a dedicated worker pool of N GPU or CPU processes. Workers are stateless request handlers. Auto-scaling triggers when the request queue depth exceeds a threshold (e.g., queue depth > 50 → scale out; queue depth < 5 for 3 minutes → scale in). Workers maintain warm model state; cold starts are avoided by pre-warming before traffic shift.

Inference Caching

cache_key = SHA256(model_name + version + serialize(inputs))
Cache store: Redis
TTL: configurable per model (e.g., 60s for time-sensitive, 1h for stable)

Content-addressable caching returns cached predictions for identical inputs without invoking the model. Effective for repeated lookups on shared input spaces (e.g., product recommendation for popular items).

Feature Store Integration

on /predict request:
  1. Extract entity keys from input (e.g., user_id, item_id)
  2. Fetch precomputed features from Feature Store (low-latency online store, e.g., Redis or DynamoDB)
  3. Merge fetched features with request input
  4. Run inference on enriched feature vector

This keeps request payloads small and ensures training/serving feature consistency (no training-serving skew).

Shadow Deployment

Traffic split config:
  production: model v1 (100% of responses returned)
  shadow:     model v2 (X% of requests mirrored asynchronously)

Flow:
  1. Request arrives → route to production worker → return response immediately
  2. Async: clone request → route to shadow worker → log shadow output
  3. Compare production vs shadow predictions offline
  4. No shadow latency added to user-facing p99

Shadow deployment validates a new model version against live traffic without user impact before any traffic shift.

Canary Deployment

Traffic progression:
  5% → 25% → 100%   (with 30-minute soak at each stage)

Automatic rollback triggers:
  - Error rate increases by >1% vs baseline
  - p99 latency exceeds SLO (100ms)
  - Business metric (e.g., CTR) degrades by >5%

Canary shifts are gated by metric checks. If any rollback condition is met, traffic routes 100% back to the previous version and an alert fires.

Batch Prediction

POST /predict/batch → enqueue job to SQS
Worker polls SQS:
  1. Download input CSV from S3
  2. Run model inference in mini-batches (e.g., 256 rows)
  3. Write output CSV to S3 output prefix
  4. Update job status: queued → running → completed/failed
  5. Notify caller via SNS or webhook

Model Monitoring

Prediction Distribution Drift (PSI)

PSI = sum((actual_pct - expected_pct) * ln(actual_pct / expected_pct))
PSI < 0.1  → no significant drift
PSI 0.1-0.2 → moderate drift, investigate
PSI > 0.2  → significant drift, alert + retrain trigger

Feature Distribution Drift

Compare live feature value distributions against training baseline using KL divergence or Wasserstein distance per feature. Drift on input features often predicts model performance degradation before labels are available.

Alerting

Drift metrics computed hourly from inference logs. Alerts sent via PagerDuty on PSI > 0.2 or error rate spike. Dashboard shows per-model drift trend, latency percentiles, and throughput.

Scale Considerations

Separate worker pools per model prevent one large model from starving low-latency models.
S3 artifact caching on local disk avoids repeated downloads across restarts.
Inference logs are sampled (e.g., 10%) for high-throughput models to bound storage cost.
Feature store reads must be under 5ms to preserve the 100ms p99 SLO budget.
GPU workers use batching within a request window (e.g., 5ms) to amortize GPU kernel launch overhead across concurrent requests.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is shadow deployment in ML model serving?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Shadow deployment routes a copy of live production traffic to a new model version asynchronously while the production model handles all user-facing responses. The new model's predictions are logged and compared offline. This validates the new model against real traffic with zero user impact before any traffic shift.”
}
},
{
“@type”: “Question”,
“name”: “How does canary deployment work for ML models?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Canary deployment gradually shifts traffic from the current production model to the new version in stages such as 5%, 25%, and 100%, with a soak period at each stage. If error rate, latency, or business metrics degrade beyond defined thresholds during any stage, traffic is automatically routed back to the previous version.”
}
},
{
“@type”: “Question”,
“name”: “How is prediction distribution drift detected in a model serving system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Population Stability Index (PSI) measures how much the distribution of model predictions has shifted compared to a training baseline. A PSI below 0.1 indicates no significant drift, 0.1 to 0.2 signals moderate drift requiring investigation, and above 0.2 triggers an alert and a potential model retraining workflow.”
}
},
{
“@type”: “Question”,
“name”: “What is inference caching and when should it be used?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Inference caching stores model predictions keyed by a content-addressable hash of the model name, version, and serialized inputs. Identical inputs return the cached result without invoking the model. It is most effective when a significant portion of requests share the same inputs, such as recommendations for popular items or frequent queries.”
}
}
]
}