A retry service centralizes retry policy enforcement so individual microservices do not each implement their own backoff logic incorrectly. This design covers the scheduling model, backoff math, jitter strategies, idempotency enforcement, and budget management.
Requirements
Functional
- Retry failed calls with configurable backoff policy (exponential, linear, fixed).
- Add randomized jitter to prevent thundering-herd retry storms.
- Propagate idempotency keys on every retry attempt.
- Enforce a maximum attempt count and a maximum total elapsed budget.
- Support per-error-code retry eligibility (do not retry 400 Bad Request).
Non-Functional
- Scheduling overhead under 5 ms per attempt.
- Persist retry state for at-least-once delivery across process restarts.
- Emit attempt-level metrics for latency budgets and success rates.
Data Model
- RetryTask —
taskId(UUID),idempotencyKey,targetUrl,method,headers(map),bodyRef(S3 key for large payloads),policyId,attemptNumber,nextAttemptAt(epoch ms),createdAt,status(PENDING, IN_FLIGHT, SUCCEEDED, EXHAUSTED). - RetryPolicy —
policyId,maxAttempts,initialDelayMs,multiplier,maxDelayMs,totalBudgetMs,jitterType(FULL, EQUAL, DECORRELATED),retryableStatusCodes(set of ints). - AttemptLog —
taskId,attemptNumber,startedAt,durationMs,responseStatus,errorMessage.
Core Algorithms
Exponential Backoff with Full Jitter
The canonical formula for the delay before attempt n is: delay = random(0, min(maxDelayMs, initialDelayMs * multiplier^(n-1))). Full jitter means the actual delay is a uniform random value between zero and the computed cap. This spreads retries evenly across the window, eliminating coordinated spikes when many callers fail simultaneously against the same dependency.
Decorrelated Jitter
Decorrelated jitter breaks correlation between successive attempts: delay = random(initialDelayMs, min(maxDelayMs, prevDelay * 3)). It converges to higher delays than full jitter but produces smoother retry distributions under sustained failure.
Budget Enforcement
Track elapsedMs = now - task.createdAt before scheduling each attempt. If elapsedMs + estimatedCallMs > totalBudgetMs, mark the task EXHAUSTED immediately. This prevents retries from outliving the upstream caller timeout, which would make successful retries useless.
Idempotency Key Propagation
The idempotency key is set once at task creation — typically the original request ID or a caller-supplied token. On every attempt, the key is injected as the Idempotency-Key HTTP header (or placed in the gRPC metadata). The downstream service must use this key to deduplicate processing. The retry service never generates a new key for a retried task.
API Design
POST /retry-tasks— enqueue a new retry task; body includes target, policy ID, idempotency key, payload reference. ReturnstaskId.GET /retry-tasks/{taskId}— current status, attempt count, next scheduled time.DELETE /retry-tasks/{taskId}— cancel a pending task before it exhausts.POST /retry-policies— register a named policy; idempotent by policy ID.GET /retry-tasks/{taskId}/attempts— paginated attempt log with per-attempt latency and status.
Scheduler Architecture
Use a priority queue (min-heap on nextAttemptAt) backed by a durable store. A pool of worker threads polls the heap for tasks due within the next second. Each worker claims a task with an optimistic lock (UPDATE ... WHERE status=PENDING AND version=N), executes the HTTP call with a short connect timeout, writes the attempt log, then either marks the task SUCCEEDED, re-enqueues it with the next backoff delay, or marks it EXHAUSTED.
For high volume, replace the in-process heap with a Redis sorted set keyed by nextAttemptAt score. Workers use ZPOPMIN with a Lua script to atomically claim tasks. Partition by policyId or tenant to scale horizontally.
Scalability and Failure Handling
- If a worker crashes mid-attempt, the optimistic lock expires after
visibilityTimeoutMsand another worker reclaims the task, incrementingattemptNumber. - Store large request bodies in object storage and keep only the reference in the task record to bound database row size.
- Emit
retry_attempts_total{outcome, policy}andretry_task_age_secondshistograms to detect policy misconfiguration early. - Dead-letter EXHAUSTED tasks to a separate queue for manual inspection or escalation.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide