ML Model Serving Service: Low Level Design
An ML model serving service provides versioned, low-latency online inference alongside batch prediction, safe deployment strategies, and production monitoring. This design covers the model registry, serving infrastructure, shadow and canary deployments, and drift alerting.
Data Model
ModelVersion Table
ModelVersion
------------
id BIGINT PRIMARY KEY
model_name VARCHAR NOT NULL
version VARCHAR NOT NULL
framework ENUM('pytorch','tensorflow','sklearn')
artifact_s3_key VARCHAR NOT NULL
metrics JSONB -- eval metrics: accuracy, AUC, RMSE, etc.
status ENUM('staging','production','archived')
created_at TIMESTAMP
UNIQUE (model_name, version)
InferenceRequest Log
InferenceLog
------------
id BIGINT PRIMARY KEY
model_name VARCHAR
version VARCHAR
input_hash VARCHAR -- SHA256 of input for cache lookup
latency_ms INT
shadow BOOL
occurred_at TIMESTAMP
Serving Endpoints
POST /predict/{model_name}
Body: { "inputs": { ... } }
Response: { "prediction": ..., "model_version": "1.4.2", "latency_ms": 23 }
Latency SLO: p99 < 100ms
POST /predict/batch
Body: { "model_name": "...", "s3_input_uri": "s3://bucket/input.csv" }
Response: { "job_id": "..." }
Worker Architecture
Model Loading
on worker startup:
1. Read active ModelVersion WHERE model_name=X AND status='production'
2. Download artifact from S3 to local disk (cached across restarts)
3. Load model into GPU/CPU memory
4. Mark worker as ready
5. Subscribe to model update events → hot-reload on new production version
Worker Pool
Each model has a dedicated worker pool of N GPU or CPU processes. Workers are stateless request handlers. Auto-scaling triggers when the request queue depth exceeds a threshold (e.g., queue depth > 50 → scale out; queue depth < 5 for 3 minutes → scale in). Workers maintain warm model state; cold starts are avoided by pre-warming before traffic shift.
Inference Caching
cache_key = SHA256(model_name + version + serialize(inputs))
Cache store: Redis
TTL: configurable per model (e.g., 60s for time-sensitive, 1h for stable)
Content-addressable caching returns cached predictions for identical inputs without invoking the model. Effective for repeated lookups on shared input spaces (e.g., product recommendation for popular items).
Feature Store Integration
on /predict request:
1. Extract entity keys from input (e.g., user_id, item_id)
2. Fetch precomputed features from Feature Store (low-latency online store, e.g., Redis or DynamoDB)
3. Merge fetched features with request input
4. Run inference on enriched feature vector
This keeps request payloads small and ensures training/serving feature consistency (no training-serving skew).
Shadow Deployment
Traffic split config:
production: model v1 (100% of responses returned)
shadow: model v2 (X% of requests mirrored asynchronously)
Flow:
1. Request arrives → route to production worker → return response immediately
2. Async: clone request → route to shadow worker → log shadow output
3. Compare production vs shadow predictions offline
4. No shadow latency added to user-facing p99
Shadow deployment validates a new model version against live traffic without user impact before any traffic shift.
Canary Deployment
Traffic progression:
5% → 25% → 100% (with 30-minute soak at each stage)
Automatic rollback triggers:
- Error rate increases by >1% vs baseline
- p99 latency exceeds SLO (100ms)
- Business metric (e.g., CTR) degrades by >5%
Canary shifts are gated by metric checks. If any rollback condition is met, traffic routes 100% back to the previous version and an alert fires.
Batch Prediction
POST /predict/batch → enqueue job to SQS
Worker polls SQS:
1. Download input CSV from S3
2. Run model inference in mini-batches (e.g., 256 rows)
3. Write output CSV to S3 output prefix
4. Update job status: queued → running → completed/failed
5. Notify caller via SNS or webhook
Model Monitoring
Prediction Distribution Drift (PSI)
PSI = sum((actual_pct - expected_pct) * ln(actual_pct / expected_pct))
PSI < 0.1 → no significant drift
PSI 0.1-0.2 → moderate drift, investigate
PSI > 0.2 → significant drift, alert + retrain trigger
Feature Distribution Drift
Compare live feature value distributions against training baseline using KL divergence or Wasserstein distance per feature. Drift on input features often predicts model performance degradation before labels are available.
Alerting
Drift metrics computed hourly from inference logs. Alerts sent via PagerDuty on PSI > 0.2 or error rate spike. Dashboard shows per-model drift trend, latency percentiles, and throughput.
Scale Considerations
- Separate worker pools per model prevent one large model from starving low-latency models.
- S3 artifact caching on local disk avoids repeated downloads across restarts.
- Inference logs are sampled (e.g., 10%) for high-throughput models to bound storage cost.
- Feature store reads must be under 5ms to preserve the 100ms p99 SLO budget.
- GPU workers use batching within a request window (e.g., 5ms) to amortize GPU kernel launch overhead across concurrent requests.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is shadow deployment in ML model serving?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Shadow deployment routes a copy of live production traffic to a new model version asynchronously while the production model handles all user-facing responses. The new model's predictions are logged and compared offline. This validates the new model against real traffic with zero user impact before any traffic shift.”
}
},
{
“@type”: “Question”,
“name”: “How does canary deployment work for ML models?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Canary deployment gradually shifts traffic from the current production model to the new version in stages such as 5%, 25%, and 100%, with a soak period at each stage. If error rate, latency, or business metrics degrade beyond defined thresholds during any stage, traffic is automatically routed back to the previous version.”
}
},
{
“@type”: “Question”,
“name”: “How is prediction distribution drift detected in a model serving system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Population Stability Index (PSI) measures how much the distribution of model predictions has shifted compared to a training baseline. A PSI below 0.1 indicates no significant drift, 0.1 to 0.2 signals moderate drift requiring investigation, and above 0.2 triggers an alert and a potential model retraining workflow.”
}
},
{
“@type”: “Question”,
“name”: “What is inference caching and when should it be used?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Inference caching stores model predictions keyed by a content-addressable hash of the model name, version, and serialized inputs. Identical inputs return the cached result without invoking the model. It is most effective when a significant portion of requests share the same inputs, such as recommendations for popular items or frequent queries.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture