A webhook delivery system must guarantee that every event reaches its destination endpoint even when the endpoint is temporarily unavailable. This requires a persistent delivery queue, a principled retry schedule with exponential backoff and jitter, a dead letter queue for permanently failed deliveries, and a circuit breaker to stop hammering broken endpoints.
Retry Schedule
Retries follow an exponential backoff schedule. A typical configuration:
| Attempt | Delay |
|---|---|
| 1 (immediate) | 0s |
| 2 | 30s |
| 3 | 5m |
| 4 | 30m |
| 5 | 2h |
| 6 | 8h |
| 7 | 24h |
After the 7th attempt fails, the delivery moves to the dead letter queue. The schedule is stored per-webhook configuration so teams can customize it.
Jitter
Without jitter, all deliveries that failed at roughly the same time (e.g., during an endpoint outage) will schedule their retries simultaneously, creating a retry storm. Full jitter spreads retries uniformly across the backoff window:
import random
def compute_next_retry(attempt: int, base_seconds: int = 30) -> float:
"""Full jitter: uniform random in [0, backoff_seconds]."""
backoff = base_seconds * (2 ** (attempt - 1))
jitter = random.uniform(0, backoff)
return jitter
Full jitter is preferred over equal jitter (random in [backoff/2, backoff]) because it produces lower average retry delays and better spreading. The tradeoff is that some retries arrive very quickly, but since they are spread across many deliveries, the aggregate load remains bounded.
Delivery Worker
Workers use SKIP LOCKED to claim deliveries concurrently without contention:
SELECT id, endpoint_id, event_id, payload, attempts
FROM webhook_delivery
WHERE next_retry_at <= NOW()
AND attempts < max_attempts
AND status = 'pending'
ORDER BY next_retry_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED;
After claiming a row, the worker performs the HTTP POST. The delivery is marked completed on 2xx, permanently_failed on 4xx (client error — retrying will not help), or pending with an updated next_retry_at on 5xx or network timeout.
HTTP Delivery Logic
Key delivery rules:
- Timeout: 10 seconds. Endpoints that hold connections open indefinitely would exhaust worker threads.
- Redirects: follow up to 3 redirects, but record the final URL for observability.
- 2xx = success: any 200-299 status code marks the delivery completed.
- 4xx = permanent failure: the endpoint rejected the payload (bad signature, unknown event type). Retrying the same payload will produce the same result. Move directly to DLQ.
- 5xx or timeout = transient failure: schedule retry with backoff.
Dead Letter Queue
The DLQ stores deliveries that exhausted all retry attempts. Each DLQ entry captures the full payload, failure reason, and all attempt timestamps. Monitoring alerts when DLQ depth exceeds a threshold. The DLQ is not a discard bin — it is an actionable queue that operators review and either replay or explicitly discard.
Manual Replay API
Two replay mechanisms:
- Replay from DLQ by delivery ID: re-inserts the item into webhook_delivery with attempts = 0 and next_retry_at = NOW(), effectively starting the retry schedule over.
- Replay by event ID: re-queues all deliveries associated with a specific event, useful when an event was published with a bug and needs to be resent after a fix.
Replay is idempotent: the endpoint receives the same payload and event_id. Endpoint implementations should use event_id as an idempotency key.
Circuit Breaker
If an endpoint consistently returns 5xx, the circuit breaker opens and suppresses delivery attempts for a cooldown period. This prevents wasting retry budget on an endpoint that is clearly down:
- Closed (normal): deliveries proceed. Track failure rate per endpoint in a rolling window.
- Open: failure rate exceeded threshold (e.g., 5 consecutive failures or 50% failure rate over 10 minutes). New deliveries are skipped and scheduled after the cooldown (e.g., 5 minutes). next_probe_at is set to cooldown end time.
- Half-open: after cooldown, one probe delivery is attempted. If it succeeds, circuit closes. If it fails, circuit reopens with a longer cooldown (exponential backoff on the circuit itself).
SQL Schema
CREATE TABLE webhook_delivery (
id BIGSERIAL PRIMARY KEY,
endpoint_id INT NOT NULL,
event_id VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
attempts INT NOT NULL DEFAULT 0,
max_attempts INT NOT NULL DEFAULT 7,
status VARCHAR(30) NOT NULL DEFAULT 'pending',
-- pending | processing | completed | permanently_failed | dlq
next_retry_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON webhook_delivery (next_retry_at, status)
WHERE status = 'pending';
CREATE INDEX ON webhook_delivery (endpoint_id, status);
CREATE TABLE webhook_dlq (
id BIGSERIAL PRIMARY KEY,
original_id BIGINT NOT NULL,
endpoint_id INT NOT NULL,
event_id VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
attempts INT NOT NULL,
failure_reason TEXT,
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL,
moved_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE endpoint_circuit_breaker (
endpoint_id INT PRIMARY KEY,
failure_count INT NOT NULL DEFAULT 0,
state VARCHAR(20) NOT NULL DEFAULT 'closed',
-- closed | open | half_open
opened_at TIMESTAMPTZ,
next_probe_at TIMESTAMPTZ,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Python Implementation
import requests, time, random, psycopg2
RETRY_BASE_SECONDS = [0, 30, 300, 1800, 7200, 28800, 86400]
def compute_next_retry(attempt: int, base_seconds: int = 30) -> float:
"""Full jitter backoff. attempt is 1-based (attempt after the Nth failure)."""
if attempt <= 0:
return 0
backoff = base_seconds * (2 ** (attempt - 1))
return random.uniform(0, backoff)
def attempt_delivery(conn, delivery_id: int) -> bool:
"""Perform HTTP delivery; update delivery record. Return True on success."""
delivery = get_delivery(conn, delivery_id)
endpoint = get_endpoint(conn, delivery["endpoint_id"])
# Circuit breaker check
cb = get_circuit_breaker(conn, delivery["endpoint_id"])
if cb["state"] == "open":
if cb["next_probe_at"] > time.time():
# Circuit still open -- skip, do not consume an attempt
reschedule_delivery(conn, delivery_id, cb["next_probe_at"])
return False
else:
set_circuit_state(conn, delivery["endpoint_id"], "half_open")
try:
resp = requests.post(
endpoint["url"],
json=delivery["payload"],
headers={"X-Event-ID": delivery["event_id"],
"X-Webhook-Signature": sign_payload(delivery["payload"])},
timeout=10,
allow_redirects=True,
max_redirects=3
)
except requests.RequestException as exc:
handle_delivery_failure(conn, delivery, str(exc), transient=True)
record_circuit_failure(conn, delivery["endpoint_id"])
return False
if 200 <= resp.status_code < 300:
mark_completed(conn, delivery_id)
reset_circuit_breaker(conn, delivery["endpoint_id"])
return True
elif 400 <= resp.status_code < 500:
# Permanent failure -- move to DLQ immediately
handle_permanent_failure(conn, delivery,
f"HTTP {resp.status_code}: permanent error")
return False
else:
# 5xx -- transient, schedule retry
handle_delivery_failure(conn, delivery,
f"HTTP {resp.status_code}", transient=True)
record_circuit_failure(conn, delivery["endpoint_id"])
return False
def handle_delivery_failure(conn, delivery: dict, error: str, transient: bool):
new_attempts = delivery["attempts"] + 1
if not transient or new_attempts >= delivery["max_attempts"]:
move_to_dlq(conn, delivery, error)
else:
delay = compute_next_retry(new_attempts)
next_retry = time.time() + delay
with conn.cursor() as cur:
cur.execute(
"""UPDATE webhook_delivery
SET attempts = %s, next_retry_at = to_timestamp(%s),
last_error = %s, status = 'pending'
WHERE id = %s""",
(new_attempts, next_retry, error, delivery["id"])
)
conn.commit()
def replay_dlq(conn, dlq_id: int) -> int:
"""Re-enqueue a DLQ item as a fresh delivery. Return new delivery_id."""
dlq_item = get_dlq_item(conn, dlq_id)
with conn.cursor() as cur:
cur.execute(
"""INSERT INTO webhook_delivery
(endpoint_id, event_id, payload, attempts, next_retry_at)
VALUES (%s, %s, %s, 0, NOW())
RETURNING id""",
(dlq_item["endpoint_id"], dlq_item["event_id"], dlq_item["payload"])
)
new_id = cur.fetchone()[0]
conn.commit()
return new_id
def schedule_delivery(conn, endpoint_id: int, event_id: str,
payload: dict, max_attempts: int = 7) -> int:
"""Create a new pending delivery record."""
with conn.cursor() as cur:
cur.execute(
"""INSERT INTO webhook_delivery
(endpoint_id, event_id, payload, max_attempts, next_retry_at)
VALUES (%s, %s, %s, %s, NOW())
RETURNING id""",
(endpoint_id, event_id, psycopg2.extras.Json(payload), max_attempts)
)
delivery_id = cur.fetchone()[0]
conn.commit()
return delivery_id
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between full jitter and equal jitter for retry backoff?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Full jitter selects a random delay uniformly between 0 and the full backoff value. Equal jitter (also called decorrelated jitter) selects between half the backoff and the full backoff. Full jitter produces lower average retry delays and better load spreading across the full window, making it preferable for webhook retries. Equal jitter guarantees a minimum wait between retries, which some systems prefer to avoid immediate clustering. AWS best practices recommend full jitter for most retry scenarios.”
}
},
{
“@type”: “Question”,
“name”: “How should 4xx vs 5xx HTTP responses be handled differently in webhook delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A 4xx response means the endpoint rejected the request for reasons that will not change: invalid signature, unknown event type, or unauthorized. Retrying the same payload will produce the same 4xx. The delivery should be moved directly to the DLQ. A 5xx response (or timeout) means the endpoint is temporarily unavailable. The payload is likely valid and will succeed when the endpoint recovers, so retry with backoff is appropriate.”
}
},
{
“@type”: “Question”,
“name”: “How does the circuit breaker integrate with webhook retry logic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before each delivery attempt, the worker checks the endpoint's circuit state. If the circuit is open (endpoint is failing), the delivery is rescheduled to next_probe_at without consuming a retry attempt. When the cooldown expires, one probe delivery is attempted. Success closes the circuit. Failure reopens it with a longer cooldown (exponential backoff on the circuit itself). This prevents the retry queue from being exhausted on a broken endpoint while genuine retries are wasted.”
}
},
{
“@type”: “Question”,
“name”: “How does manual DLQ replay work and how is idempotency guaranteed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Replay re-inserts the DLQ item into webhook_delivery with attempts reset to 0, starting the full retry schedule over. The original event_id is preserved. Endpoint implementations should treat event_id as an idempotency key: if the event was already successfully processed (stored in a processed_events table), the duplicate can be safely ignored. This guarantees at-least-once delivery without double-processing side effects.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use full jitter instead of equal jitter for webhook retries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Full jitter (random between 0 and max_backoff) spreads retries more evenly across the backoff window, reducing thundering herd when many deliveries fail simultaneously; equal jitter clusters retries at mid-range.”
}
},
{
“@type”: “Question”,
“name”: “How are permanent failures (4xx) distinguished from transient failures (5xx)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HTTP 4xx responses indicate a client error (bad payload, authentication failure) that will not resolve on retry; they are immediately moved to the DLQ; 5xx responses indicate server-side issues that may recover, so retries are scheduled.”
}
},
{
“@type”: “Question”,
“name”: “How does the circuit breaker integrate with the retry queue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each endpoint has a circuit breaker state in EndpointCircuitBreaker; before attempting delivery, the worker checks the state — OPEN circuits skip the HTTP call, update status to CIRCUIT_OPEN, and schedule a probe retry after cooldown.”
}
},
{
“@type”: “Question”,
“name”: “How is manual DLQ replay implemented safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The replay API re-enqueues the DLQ item as a new WebhookDelivery with attempts reset to 0; the original DLQ record is retained (not deleted) until the replayed delivery succeeds, preventing data loss.”
}
}
]
}
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety