Why Retry Policies Matter
Transient failures — network blips, momentary overload, upstream restarts — are normal in distributed systems. A well-designed retry policy recovers from them transparently. A poorly designed one amplifies failures: simultaneous retries from thousands of clients overwhelm a recovering service (thundering herd) or retry non-recoverable errors endlessly.
Retryable vs. Non-Retryable Errors
The first rule: only retry errors that may succeed on a subsequent attempt.
Retryable
- 5xx server errors: 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout — server-side transient failures.
- Network timeouts: Connection timeout, read timeout — the server may have processed the request or may not have.
- 429 Too Many Requests: Rate limit exceeded — retry after the
Retry-Afterheader value.
Non-Retryable
- 4xx client errors: 400 Bad Request (malformed input), 401 Unauthorized (bad credentials), 403 Forbidden (insufficient permissions), 404 Not Found, 409 Conflict (duplicate resource). Retrying these wastes resources and never succeeds without fixing the request.
- Business logic errors: Insufficient funds, invalid card — retrying does not change the outcome.
Exponential Backoff
Wait longer between each successive retry to give the downstream service time to recover:
delay = min(base * 2^attempt, max_delay)
-- Example with base=1s, max=32s:
Attempt 0: immediate
Attempt 1: wait 1s
Attempt 2: wait 2s
Attempt 3: wait 4s
Attempt 4: wait 8s
Attempt 5: wait 16s
Attempt 6: wait 32s (capped)
Without the cap, delays grow unboundedly. Cap at a value that fits your SLA (e.g., 30-60s for interactive flows, longer for background jobs).
Jitter: Preventing Thundering Herd
Without jitter, all clients that failed at time T retry at T+1s, T+2s, T+4s simultaneously — a synchronized wave of retries that can overwhelm a recovering service. Jitter randomizes the delay.
Full Jitter
delay = random(0, min(base * 2^attempt, max_delay))
Simple and effective. Spreads retries uniformly across the window. Recommended default.
Decorrelated Jitter
delay = min(max_delay, random(base, prev_delay * 3))
Produces a wider spread with higher average delay. Useful when full jitter produces too many short delays under high concurrency.
Max Attempts and Max Delay Caps
Always set both:
- Max attempts: Prevents infinite retry loops. Typically 3-5 for interactive requests, more for background jobs.
- Max delay: Caps individual wait time. Prevents a single request from waiting minutes between retries in interactive flows.
- Total timeout: Bounds the entire retry sequence. A request that has been retrying for 30s should stop even if max attempts are not exhausted.
Deadline Propagation
Do not retry if the client's deadline has already passed. Continuing to retry after the client gave up wastes resources on work no one is waiting for:
deadline = request.start_time + request.timeout
before each retry:
if now() >= deadline:
raise DeadlineExceededException()
remaining = deadline - now()
if remaining < min_useful_time:
raise DeadlineExceededException()
Pass the deadline downstream as a header (e.g., X-Request-Deadline or gRPC deadline) so downstream services also stop processing when the deadline passes.
Idempotency Requirement for Mutation Retries
Retrying a network timeout on a POST request is dangerous: the server may have processed the request but the response was lost. Retrying causes duplicate execution. Only retry mutations when:
- The operation is idempotent (DELETE, PUT with full replacement), or
- The request includes an idempotency key that the server uses to deduplicate.
For payment retries: always include an idempotency key. For search queries (GET): retry freely — reads are naturally idempotent.
Retry Budgets
At high request volume, even a 10% retry rate doubles effective load on a struggling service. A retry budget limits the fraction of total requests that are retries:
-- Allow at most 10% of requests to be retries
if retry_count / total_count > 0.10:
reject_retry() -- fail immediately rather than retry
Google's SRE book recommends retry budgets for high-volume services. Implement with a token bucket: each retry consumes a token; tokens replenish at a rate proportional to the budget fraction.
Circuit Breaker Integration
Retries and circuit breakers work together. When the circuit is OPEN, skip retries entirely — the circuit breaker has already determined the downstream is unavailable. Retry logic should check circuit breaker state before each attempt:
for attempt in range(max_attempts):
if circuit_breaker.state == OPEN:
raise CircuitOpenException()
try:
return call_downstream()
except RetryableError:
if attempt < max_attempts - 1:
sleep(backoff_with_jitter(attempt))
Retry-After Header (429 Handling)
When a server returns 429, it may include a Retry-After header specifying when to retry:
HTTP/1.1 429 Too Many Requests
Retry-After: 15
Honor this header rather than using your own backoff schedule. Ignoring it and retrying immediately will just receive another 429.
Dead Letter Queue for Async Retries
For asynchronous message processing, after exhausting retries, move the message to a Dead Letter Queue (DLQ) rather than dropping it. The DLQ allows manual inspection, reprocessing after a fix is deployed, and alerts on stuck messages. Monitor DLQ depth as a health signal.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does exponential backoff with jitter prevent thundering herds?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Exponential backoff multiplies the wait interval by a factor (commonly 2) after each failure, spreading retries over time; adding random jitter (uniform or decorrelated) desynchronizes clients that all failed at the same moment, so they don't re-hammer the recovering service in a coordinated wave. AWS recommends decorrelated jitter — sleep = random_between(base, prev_sleep * 3) — as it produces the most uniform distribution.”
}
},
{
“@type”: “Question”,
“name”: “How are retryable vs non-retryable errors classified?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Transient errors caused by resource exhaustion or network blips (HTTP 429, 503, connection timeouts) are retryable, while errors indicating a permanent client mistake (HTTP 400, 401, 404) or unrecoverable server state are not. The distinction is typically encoded in the error type hierarchy or a boolean flag on the exception so retry logic can inspect it without parsing message strings.”
}
},
{
“@type”: “Question”,
“name”: “How does deadline propagation prevent pointless retries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A request carries a deadline (absolute timestamp) that is forwarded to every downstream call; before attempting a retry each layer checks whether sufficient time remains to complete the attempt and still return a response to the original caller. This ensures retries are abandoned when the client has already timed out, avoiding wasted work and load on downstream services.”
}
},
{
“@type”: “Question”,
“name”: “What is a retry budget and how is it enforced?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A retry budget caps the ratio of retry requests to original requests across a service or client — for example, no more than 10% of outgoing calls may be retries — preventing retry amplification storms when error rates spike. It is enforced by tracking a per-client or per-service counter in a token bucket or sliding window and rejecting (or dropping) retries that would exceed the budget.”
}
}
]
}
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering