Image Recognition Service Low-Level Design: Model Serving, Batch Inference, and Confidence Thresholds

Image Recognition Service System Design Overview

An image recognition service accepts raw image input, runs it through one or more ML models to produce labels, bounding boxes, or embeddings, and returns structured results. Production requirements introduce complexity beyond model accuracy: asynchronous batch processing, model version routing, confidence thresholding, and aggressive result caching are all necessary to operate at scale within cost constraints.

Requirements

Functional Requirements

  • Accept image uploads via URL or binary multipart payload.
  • Run inference using configurable models: object detection, scene classification, face detection, NSFW classification.
  • Support synchronous responses for single images under 2MB and asynchronous batch processing for large jobs.
  • Apply per-model confidence thresholds; suppress low-confidence labels from results.
  • Route requests to the appropriate model version based on caller-specified model_id.

Non-Functional Requirements

  • Synchronous P99 latency under 800ms for single-image inference.
  • Batch job completion within 5 minutes for up to 10,000 images.
  • GPU utilization above 70% during business hours.
  • Result cache hit rate above 60% (many images are submitted multiple times).

Data Model

  • inference_requests: request_id, caller_id, model_id, image_hash, status (pending, processing, done, failed), submitted_at, completed_at
  • inference_results: request_id, model_id, model_version, labels (JSON array of label, score, bounding_box), raw_output_path (S3 URI), created_at
  • model_registry: model_id, version, artifact_s3_path, framework (TorchScript, ONNX), input_shape, default_confidence_threshold, is_active
  • result_cache: Redis hash keyed by (image_hash, model_id, model_version) storing compressed inference_results JSON, TTL 7 days.

Core Algorithms

Async Batch Inference Pipeline

Batch jobs are submitted via a REST endpoint and enqueued to a Kafka topic partitioned by caller_id. A fleet of GPU worker nodes consumes from this topic. Each worker pulls a micro-batch of 32 images (configurable), loads them into GPU memory as a tensor batch using CUDA pinned memory, runs a single forward pass, and writes results to the inference_results table and the result cache. Micro-batching is critical: a single GPU forward pass over 32 images takes roughly the same wall-clock time as a single-image pass, yielding 32x throughput improvement. Workers use NVIDIA Triton Inference Server as the model runtime, which handles batching, model loading, and GPU memory management internally.

Confidence Thresholding

Each model defines a default_confidence_threshold in the model registry. After inference, the service filters the raw output scores array, retaining only labels with score >= threshold. Callers may specify a custom threshold per request to override the default. For multi-label models, a secondary threshold (min_label_count) ensures at least one label is always returned even if no label clears the confidence bar, by lowering the threshold progressively in 0.05 increments until at least one label qualifies. This prevents silent empty results from confusing downstream consumers.

Model Version Routing

The model_registry table tracks multiple versions per model_id. Callers specify model_id only; the service resolves the active version at request time by querying a Redis key model:active:{model_id} that caches the current active version string with a 60-second TTL. A model promotion workflow updates this key after a new version passes offline evaluation benchmarks. Canary releases route 5% of traffic to a new version by storing a canary_version and canary_weight alongside the active version; the router samples a random float and routes to the canary if it falls below canary_weight.

Result Caching

Before dispatching to a GPU worker, the service computes a SHA-256 hash of the image bytes and checks the Redis result cache keyed by (image_hash, model_id, model_version). Cache hits bypass the GPU queue entirely and return results within 5ms. On cache miss, results are written back after inference. Cache entries are compressed with LZ4 (typically 4:1 compression ratio on JSON label arrays) to fit more entries in Redis memory. The cache is shared across all callers, so popular stock images or product photos benefit immediately from the first caller warming the entry.

Scalability

GPU workers are managed by a Kubernetes deployment with custom resource type gpu. Horizontal pod autoscaler scales worker count based on Kafka consumer group lag: each additional 500 pending messages triggers one additional GPU pod, with a maximum of 50 pods. GPU pods use node affinity to land on A10G or A100 nodes. CPU preprocessing (image decoding, normalization, resizing) is offloaded to a separate CPU worker pool to avoid wasting GPU cycles on non-compute-bound steps.

API Design

POST /v1/recognize

  • Body: image_url or image_data (base64), model_id, confidence_threshold (optional)
  • Response: request_id, status (done if synchronous fast path used), labels array, cache_hit boolean, model_version

POST /v1/recognize/batch

  • Body: items array (each with image_url, model_id), callback_url (webhook for completion notification)
  • Response: batch_id, estimated_completion_seconds

GET /v1/recognize/batch/{batch_id}

  • Response: status, completed_count, total_count, results array (when done)

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top