How does exponential backoff with jitter prevent thundering herds?

Exponential backoff multiplies the wait interval by a factor (commonly 2) after each failure, spreading retries over time; adding random jitter (uniform or decorrelated) desynchronizes clients that all failed at the same moment, so they don't re-hammer the recovering service in a coordinated wave. AWS recommends decorrelated jitter — sleep = random_between(base, prev_sleep * 3) — as it produces the most uniform distribution.

How are retryable vs non-retryable errors classified?

Transient errors caused by resource exhaustion or network blips (HTTP 429, 503, connection timeouts) are retryable, while errors indicating a permanent client mistake (HTTP 400, 401, 404) or unrecoverable server state are not. The distinction is typically encoded in the error type hierarchy or a boolean flag on the exception so retry logic can inspect it without parsing message strings.

How does deadline propagation prevent pointless retries?

A request carries a deadline (absolute timestamp) that is forwarded to every downstream call; before attempting a retry each layer checks whether sufficient time remains to complete the attempt and still return a response to the original caller. This ensures retries are abandoned when the client has already timed out, avoiding wasted work and load on downstream services.

What is a retry budget and how is it enforced?

A retry budget caps the ratio of retry requests to original requests across a service or client — for example, no more than 10% of outgoing calls may be retries — preventing retry amplification storms when error rates spike. It is enforced by tracking a per-client or per-service counter in a token bucket or sliding window and rejecting (or dropping) retries that would exceed the budget.

Retry Policy Low-Level Design: Exponential Backoff, Jitter, and Idempotent Retry Safety

⏱ 5 min read

Why Retry Policies Matter

Transient failures — network blips, momentary overload, upstream restarts — are normal in distributed systems. A well-designed retry policy recovers from them transparently. A poorly designed one amplifies failures: simultaneous retries from thousands of clients overwhelm a recovering service (thundering herd) or retry non-recoverable errors endlessly.

Retryable vs. Non-Retryable Errors

The first rule: only retry errors that may succeed on a subsequent attempt.

Retryable

5xx server errors: 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout — server-side transient failures.
Network timeouts: Connection timeout, read timeout — the server may have processed the request or may not have.
429 Too Many Requests: Rate limit exceeded — retry after the Retry-After header value.

Non-Retryable

4xx client errors: 400 Bad Request (malformed input), 401 Unauthorized (bad credentials), 403 Forbidden (insufficient permissions), 404 Not Found, 409 Conflict (duplicate resource). Retrying these wastes resources and never succeeds without fixing the request.
Business logic errors: Insufficient funds, invalid card — retrying does not change the outcome.

Exponential Backoff

Wait longer between each successive retry to give the downstream service time to recover:

delay = min(base * 2^attempt, max_delay)

-- Example with base=1s, max=32s:
Attempt 0: immediate
Attempt 1: wait 1s
Attempt 2: wait 2s
Attempt 3: wait 4s
Attempt 4: wait 8s
Attempt 5: wait 16s
Attempt 6: wait 32s  (capped)

Without the cap, delays grow unboundedly. Cap at a value that fits your SLA (e.g., 30-60s for interactive flows, longer for background jobs).

Jitter: Preventing Thundering Herd

Without jitter, all clients that failed at time T retry at T+1s, T+2s, T+4s simultaneously — a synchronized wave of retries that can overwhelm a recovering service. Jitter randomizes the delay.

Full Jitter

delay = random(0, min(base * 2^attempt, max_delay))

Simple and effective. Spreads retries uniformly across the window. Recommended default.

Decorrelated Jitter

delay = min(max_delay, random(base, prev_delay * 3))

Produces a wider spread with higher average delay. Useful when full jitter produces too many short delays under high concurrency.

Max Attempts and Max Delay Caps

Always set both:

Max attempts: Prevents infinite retry loops. Typically 3-5 for interactive requests, more for background jobs.
Max delay: Caps individual wait time. Prevents a single request from waiting minutes between retries in interactive flows.
Total timeout: Bounds the entire retry sequence. A request that has been retrying for 30s should stop even if max attempts are not exhausted.

Deadline Propagation

Do not retry if the client's deadline has already passed. Continuing to retry after the client gave up wastes resources on work no one is waiting for:

deadline = request.start_time + request.timeout

before each retry:
    if now() >= deadline:
        raise DeadlineExceededException()

    remaining = deadline - now()
    if remaining < min_useful_time:
        raise DeadlineExceededException()

Pass the deadline downstream as a header (e.g., X-Request-Deadline or gRPC deadline) so downstream services also stop processing when the deadline passes.

Idempotency Requirement for Mutation Retries

Retrying a network timeout on a POST request is dangerous: the server may have processed the request but the response was lost. Retrying causes duplicate execution. Only retry mutations when:

The operation is idempotent (DELETE, PUT with full replacement), or
The request includes an idempotency key that the server uses to deduplicate.

For payment retries: always include an idempotency key. For search queries (GET): retry freely — reads are naturally idempotent.

Retry Budgets

At high request volume, even a 10% retry rate doubles effective load on a struggling service. A retry budget limits the fraction of total requests that are retries:

-- Allow at most 10% of requests to be retries
if retry_count / total_count > 0.10:
    reject_retry()   -- fail immediately rather than retry

Google's SRE book recommends retry budgets for high-volume services. Implement with a token bucket: each retry consumes a token; tokens replenish at a rate proportional to the budget fraction.

Circuit Breaker Integration

Retries and circuit breakers work together. When the circuit is OPEN, skip retries entirely — the circuit breaker has already determined the downstream is unavailable. Retry logic should check circuit breaker state before each attempt:

for attempt in range(max_attempts):
    if circuit_breaker.state == OPEN:
        raise CircuitOpenException()
    try:
        return call_downstream()
    except RetryableError:
        if attempt < max_attempts - 1:
            sleep(backoff_with_jitter(attempt))

Retry-After Header (429 Handling)

When a server returns 429, it may include a Retry-After header specifying when to retry:

HTTP/1.1 429 Too Many Requests
Retry-After: 15

Honor this header rather than using your own backoff schedule. Ignoring it and retrying immediately will just receive another 429.

Dead Letter Queue for Async Retries

For asynchronous message processing, after exhausting retries, move the message to a Dead Letter Queue (DLQ) rather than dropping it. The DLQ allows manual inspection, reprocessing after a fix is deployed, and alerts on stuck messages. Monitor DLQ depth as a health signal.