Webhook Retry Queue Low-Level Design: Exponential Backoff, Dead Letter Handling, and Delivery Guarantees

A webhook delivery system must guarantee that every event reaches its destination endpoint even when the endpoint is temporarily unavailable. This requires a persistent delivery queue, a principled retry schedule with exponential backoff and jitter, a dead letter queue for permanently failed deliveries, and a circuit breaker to stop hammering broken endpoints.

Retry Schedule

Retries follow an exponential backoff schedule. A typical configuration:

Attempt	Delay
1 (immediate)	0s
2	30s
3	5m
4	30m
5	2h
6	8h
7	24h

After the 7th attempt fails, the delivery moves to the dead letter queue. The schedule is stored per-webhook configuration so teams can customize it.

Jitter

Without jitter, all deliveries that failed at roughly the same time (e.g., during an endpoint outage) will schedule their retries simultaneously, creating a retry storm. Full jitter spreads retries uniformly across the backoff window:

import random

def compute_next_retry(attempt: int, base_seconds: int = 30) -> float:
    """Full jitter: uniform random in [0, backoff_seconds]."""
    backoff = base_seconds * (2 ** (attempt - 1))
    jitter = random.uniform(0, backoff)
    return jitter

Full jitter is preferred over equal jitter (random in [backoff/2, backoff]) because it produces lower average retry delays and better spreading. The tradeoff is that some retries arrive very quickly, but since they are spread across many deliveries, the aggregate load remains bounded.

Delivery Worker

Workers use SKIP LOCKED to claim deliveries concurrently without contention:

SELECT id, endpoint_id, event_id, payload, attempts
FROM webhook_delivery
WHERE next_retry_at <= NOW()
  AND attempts < max_attempts
  AND status = 'pending'
ORDER BY next_retry_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;

After claiming a row, the worker performs the HTTP POST. The delivery is marked completed on 2xx, permanently_failed on 4xx (client error — retrying will not help), or pending with an updated next_retry_at on 5xx or network timeout.

HTTP Delivery Logic

Key delivery rules:

Timeout: 10 seconds. Endpoints that hold connections open indefinitely would exhaust worker threads.
Redirects: follow up to 3 redirects, but record the final URL for observability.
2xx = success: any 200-299 status code marks the delivery completed.
4xx = permanent failure: the endpoint rejected the payload (bad signature, unknown event type). Retrying the same payload will produce the same result. Move directly to DLQ.
5xx or timeout = transient failure: schedule retry with backoff.

Dead Letter Queue

The DLQ stores deliveries that exhausted all retry attempts. Each DLQ entry captures the full payload, failure reason, and all attempt timestamps. Monitoring alerts when DLQ depth exceeds a threshold. The DLQ is not a discard bin — it is an actionable queue that operators review and either replay or explicitly discard.

Manual Replay API

Two replay mechanisms:

Replay from DLQ by delivery ID: re-inserts the item into webhook_delivery with attempts = 0 and next_retry_at = NOW(), effectively starting the retry schedule over.
Replay by event ID: re-queues all deliveries associated with a specific event, useful when an event was published with a bug and needs to be resent after a fix.

Replay is idempotent: the endpoint receives the same payload and event_id. Endpoint implementations should use event_id as an idempotency key.

Circuit Breaker

If an endpoint consistently returns 5xx, the circuit breaker opens and suppresses delivery attempts for a cooldown period. This prevents wasting retry budget on an endpoint that is clearly down:

Closed (normal): deliveries proceed. Track failure rate per endpoint in a rolling window.
Open: failure rate exceeded threshold (e.g., 5 consecutive failures or 50% failure rate over 10 minutes). New deliveries are skipped and scheduled after the cooldown (e.g., 5 minutes). next_probe_at is set to cooldown end time.
Half-open: after cooldown, one probe delivery is attempted. If it succeeds, circuit closes. If it fails, circuit reopens with a longer cooldown (exponential backoff on the circuit itself).

SQL Schema

CREATE TABLE webhook_delivery (
    id             BIGSERIAL PRIMARY KEY,
    endpoint_id    INT NOT NULL,
    event_id       VARCHAR(100) NOT NULL,
    payload        JSONB NOT NULL,
    attempts       INT NOT NULL DEFAULT 0,
    max_attempts   INT NOT NULL DEFAULT 7,
    status         VARCHAR(30) NOT NULL DEFAULT 'pending',
    -- pending | processing | completed | permanently_failed | dlq
    next_retry_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_error     TEXT,
    created_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON webhook_delivery (next_retry_at, status)
    WHERE status = 'pending';
CREATE INDEX ON webhook_delivery (endpoint_id, status);

CREATE TABLE webhook_dlq (
    id              BIGSERIAL PRIMARY KEY,
    original_id     BIGINT NOT NULL,
    endpoint_id     INT NOT NULL,
    event_id        VARCHAR(100) NOT NULL,
    payload         JSONB NOT NULL,
    attempts        INT NOT NULL,
    failure_reason  TEXT,
    last_error      TEXT,
    created_at      TIMESTAMPTZ NOT NULL,
    moved_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE endpoint_circuit_breaker (
    endpoint_id     INT PRIMARY KEY,
    failure_count   INT NOT NULL DEFAULT 0,
    state           VARCHAR(20) NOT NULL DEFAULT 'closed',
    -- closed | open | half_open
    opened_at       TIMESTAMPTZ,
    next_probe_at   TIMESTAMPTZ,
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Python Implementation

import requests, time, random, psycopg2

RETRY_BASE_SECONDS = [0, 30, 300, 1800, 7200, 28800, 86400]

def compute_next_retry(attempt: int, base_seconds: int = 30) -> float:
    """Full jitter backoff. attempt is 1-based (attempt after the Nth failure)."""
    if attempt <= 0:
        return 0
    backoff = base_seconds * (2 ** (attempt - 1))
    return random.uniform(0, backoff)

def attempt_delivery(conn, delivery_id: int) -> bool:
    """Perform HTTP delivery; update delivery record. Return True on success."""
    delivery = get_delivery(conn, delivery_id)
    endpoint = get_endpoint(conn, delivery["endpoint_id"])

    # Circuit breaker check
    cb = get_circuit_breaker(conn, delivery["endpoint_id"])
    if cb["state"] == "open":
        if cb["next_probe_at"] > time.time():
            # Circuit still open -- skip, do not consume an attempt
            reschedule_delivery(conn, delivery_id, cb["next_probe_at"])
            return False
        else:
            set_circuit_state(conn, delivery["endpoint_id"], "half_open")

    try:
        resp = requests.post(
            endpoint["url"],
            json=delivery["payload"],
            headers={"X-Event-ID": delivery["event_id"],
                     "X-Webhook-Signature": sign_payload(delivery["payload"])},
            timeout=10,
            allow_redirects=True,
            max_redirects=3
        )
    except requests.RequestException as exc:
        handle_delivery_failure(conn, delivery, str(exc), transient=True)
        record_circuit_failure(conn, delivery["endpoint_id"])
        return False

    if 200 <= resp.status_code < 300:
        mark_completed(conn, delivery_id)
        reset_circuit_breaker(conn, delivery["endpoint_id"])
        return True
    elif 400 <= resp.status_code < 500:
        # Permanent failure -- move to DLQ immediately
        handle_permanent_failure(conn, delivery,
                                  f"HTTP {resp.status_code}: permanent error")
        return False
    else:
        # 5xx -- transient, schedule retry
        handle_delivery_failure(conn, delivery,
                                 f"HTTP {resp.status_code}", transient=True)
        record_circuit_failure(conn, delivery["endpoint_id"])
        return False

def handle_delivery_failure(conn, delivery: dict, error: str, transient: bool):
    new_attempts = delivery["attempts"] + 1
    if not transient or new_attempts >= delivery["max_attempts"]:
        move_to_dlq(conn, delivery, error)
    else:
        delay = compute_next_retry(new_attempts)
        next_retry = time.time() + delay
        with conn.cursor() as cur:
            cur.execute(
                """UPDATE webhook_delivery
                   SET attempts = %s, next_retry_at = to_timestamp(%s),
                       last_error = %s, status = 'pending'
                   WHERE id = %s""",
                (new_attempts, next_retry, error, delivery["id"])
            )
        conn.commit()

def replay_dlq(conn, dlq_id: int) -> int:
    """Re-enqueue a DLQ item as a fresh delivery. Return new delivery_id."""
    dlq_item = get_dlq_item(conn, dlq_id)
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO webhook_delivery
               (endpoint_id, event_id, payload, attempts, next_retry_at)
               VALUES (%s, %s, %s, 0, NOW())
               RETURNING id""",
            (dlq_item["endpoint_id"], dlq_item["event_id"], dlq_item["payload"])
        )
        new_id = cur.fetchone()[0]
    conn.commit()
    return new_id

def schedule_delivery(conn, endpoint_id: int, event_id: str,
                       payload: dict, max_attempts: int = 7) -> int:
    """Create a new pending delivery record."""
    with conn.cursor() as cur:
        cur.execute(
            """INSERT INTO webhook_delivery
               (endpoint_id, event_id, payload, max_attempts, next_retry_at)
               VALUES (%s, %s, %s, %s, NOW())
               RETURNING id""",
            (endpoint_id, event_id, psycopg2.extras.Json(payload), max_attempts)
        )
        delivery_id = cur.fetchone()[0]
    conn.commit()
    return delivery_id

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between full jitter and equal jitter for retry backoff?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Full jitter selects a random delay uniformly between 0 and the full backoff value. Equal jitter (also called decorrelated jitter) selects between half the backoff and the full backoff. Full jitter produces lower average retry delays and better load spreading across the full window, making it preferable for webhook retries. Equal jitter guarantees a minimum wait between retries, which some systems prefer to avoid immediate clustering. AWS best practices recommend full jitter for most retry scenarios.”
}
},
{
“@type”: “Question”,
“name”: “How should 4xx vs 5xx HTTP responses be handled differently in webhook delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A 4xx response means the endpoint rejected the request for reasons that will not change: invalid signature, unknown event type, or unauthorized. Retrying the same payload will produce the same 4xx. The delivery should be moved directly to the DLQ. A 5xx response (or timeout) means the endpoint is temporarily unavailable. The payload is likely valid and will succeed when the endpoint recovers, so retry with backoff is appropriate.”
}
},
{
“@type”: “Question”,
“name”: “How does the circuit breaker integrate with webhook retry logic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before each delivery attempt, the worker checks the endpoint's circuit state. If the circuit is open (endpoint is failing), the delivery is rescheduled to next_probe_at without consuming a retry attempt. When the cooldown expires, one probe delivery is attempted. Success closes the circuit. Failure reopens it with a longer cooldown (exponential backoff on the circuit itself). This prevents the retry queue from being exhausted on a broken endpoint while genuine retries are wasted.”
}
},
{
“@type”: “Question”,
“name”: “How does manual DLQ replay work and how is idempotency guaranteed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Replay re-inserts the DLQ item into webhook_delivery with attempts reset to 0, starting the full retry schedule over. The original event_id is preserved. Endpoint implementations should treat event_id as an idempotency key: if the event was already successfully processed (stored in a processed_events table), the duplicate can be safely ignored. This guarantees at-least-once delivery without double-processing side effects.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use full jitter instead of equal jitter for webhook retries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Full jitter (random between 0 and max_backoff) spreads retries more evenly across the backoff window, reducing thundering herd when many deliveries fail simultaneously; equal jitter clusters retries at mid-range.”
}
},
{
“@type”: “Question”,
“name”: “How are permanent failures (4xx) distinguished from transient failures (5xx)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HTTP 4xx responses indicate a client error (bad payload, authentication failure) that will not resolve on retry; they are immediately moved to the DLQ; 5xx responses indicate server-side issues that may recover, so retries are scheduled.”
}
},
{
“@type”: “Question”,
“name”: “How does the circuit breaker integrate with the retry queue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each endpoint has a circuit breaker state in EndpointCircuitBreaker; before attempting delivery, the worker checks the state — OPEN circuits skip the HTTP call, update status to CIRCUIT_OPEN, and schedule a probe retry after cooldown.”
}
},
{
“@type”: “Question”,
“name”: “How is manual DLQ replay implemented safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The replay API re-enqueues the DLQ item as a new WebhookDelivery with attempts reset to 0; the original DLQ record is retained (not deleted) until the replayed delivery succeeds, preventing data loss.”
}
}
]
}