Webhook Retry System — Low-Level Design
A webhook retry system reliably delivers HTTP callbacks to external consumer endpoints, handling transient failures through exponential backoff, dead-letter queues, and idempotency guarantees. This is asked at companies like Stripe, Shopify, and Twilio where webhook reliability is a core product feature.
Core Data Model
WebhookEndpoint
id BIGSERIAL PK
account_id BIGINT NOT NULL
url TEXT NOT NULL
secret TEXT NOT NULL -- for HMAC signature
enabled BOOLEAN DEFAULT true
created_at TIMESTAMPTZ
WebhookEvent
id BIGSERIAL PK
endpoint_id BIGINT FK
event_type TEXT NOT NULL -- payment.succeeded, order.created
payload JSONB NOT NULL
idempotency_key TEXT UNIQUE NOT NULL -- prevents duplicate delivery
status TEXT DEFAULT 'pending' -- pending, delivered, failed, dead
attempt_count INT DEFAULT 0
next_attempt_at TIMESTAMPTZ
delivered_at TIMESTAMPTZ
created_at TIMESTAMPTZ
WebhookAttempt
id BIGSERIAL PK
event_id BIGINT FK
attempted_at TIMESTAMPTZ
http_status INT
response_body TEXT
duration_ms INT
error_message TEXT
Retry Schedule with Exponential Backoff
The standard retry schedule for webhook delivery:
Attempt 1: immediate
Attempt 2: 1 minute after failure
Attempt 3: 5 minutes after failure
Attempt 4: 30 minutes after failure
Attempt 5: 2 hours after failure
Attempt 6: 8 hours after failure
Attempt 7: 24 hours after failure
After attempt 7: move to dead-letter queue, status=dead
Implementation using next_attempt_at:
RETRY_DELAYS = [0, 60, 300, 1800, 7200, 28800, 86400] # seconds
def schedule_retry(event_id, attempt_count):
if attempt_count >= len(RETRY_DELAYS):
db.execute("""
UPDATE WebhookEvent
SET status='dead'
WHERE id=%(id)s
""", {'id': event_id})
return
delay = RETRY_DELAYS[attempt_count]
db.execute("""
UPDATE WebhookEvent
SET next_attempt_at = NOW() + INTERVAL '%(delay)s seconds',
attempt_count = %(count)s
WHERE id=%(id)s
""", {'delay': delay, 'count': attempt_count, 'id': event_id})
Worker: Polling for Due Events
-- Claim a batch of due events atomically (skip-locked prevents double-processing)
UPDATE WebhookEvent
SET status='processing'
WHERE id IN (
SELECT id FROM WebhookEvent
WHERE status='pending'
AND next_attempt_at <= NOW()
ORDER BY next_attempt_at
LIMIT 50
FOR UPDATE SKIP LOCKED
)
RETURNING *;
After claiming, the worker delivers each event and records the attempt:
def deliver_event(event):
endpoint = db.get(WebhookEndpoint, event.endpoint_id)
signature = hmac_sign(endpoint.secret, event.payload)
try:
resp = http_post(
url=endpoint.url,
body=event.payload,
headers={
'X-Webhook-Signature': signature,
'X-Webhook-Event': event.event_type,
'X-Idempotency-Key': event.idempotency_key,
},
timeout=30
)
success = 200 <= resp.status_code < 300
except TimeoutError:
resp = None
success = False
db.insert(WebhookAttempt, {
'event_id': event.id,
'http_status': resp.status_code if resp else None,
'response_body': resp.text[:1000] if resp else None,
'error_message': None if success else 'timeout or non-2xx',
})
if success:
db.execute("UPDATE WebhookEvent SET status='delivered', delivered_at=NOW() WHERE id=%(id)s", {'id': event.id})
else:
schedule_retry(event.id, event.attempt_count + 1)
HMAC Signature Verification (Consumer Side)
def verify_webhook(secret, raw_body, signature_header):
expected = hmac.new(
secret.encode(),
raw_body,
hashlib.sha256
).hexdigest()
# Constant-time comparison to prevent timing attacks
return hmac.compare_digest(f'sha256={expected}', signature_header)
The consumer must verify the signature before processing the payload. Use constant-time comparison to prevent timing-attack extraction of the secret.
Idempotency on the Consumer Side
Webhooks are delivered at-least-once — the consumer must be idempotent. Use the X-Idempotency-Key header:
def handle_payment_succeeded(idempotency_key, payload):
# Check if already processed
if ProcessedEvent.exists(idempotency_key=idempotency_key):
return # Already handled, safe to ignore
# Process the event
order_id = payload['order_id']
mark_order_paid(order_id)
# Record as processed AFTER successful handling
ProcessedEvent.create(
idempotency_key=idempotency_key,
processed_at=now()
)
Dead-Letter Queue and Manual Replay
Events in status=dead need operator attention. Provide a replay API:
POST /webhooks/events/{event_id}/replay
-- Resets status to 'pending', attempt_count to 0, next_attempt_at to NOW()
GET /webhooks/endpoints/{endpoint_id}/events?status=dead
-- Lists all dead events for investigation
GET /webhooks/events/{event_id}/attempts
-- Full delivery attempt log for debugging
Scaling the Delivery Workers
- Horizontal scaling: FOR UPDATE SKIP LOCKED ensures multiple workers don’t double-deliver the same event. Add workers freely.
- Per-endpoint rate limiting: Limit delivery concurrency per endpoint (max 5 in-flight) to avoid overwhelming consumer servers.
- Timeout per delivery: Cap HTTP requests at 30 seconds. Slow consumers should not block workers.
- Separate queues by priority: New events go to a fast queue; retry events go to a separate queue so fresh deliveries aren’t delayed by a backlog of retries.
Key Interview Points
- At-least-once vs exactly-once: Webhooks are at-least-once. Exactly-once requires distributed consensus — too expensive. Design consumers to be idempotent instead.
- FOR UPDATE SKIP LOCKED: The correct pattern for multi-worker job queues in PostgreSQL. Prevents double-processing without a separate locking system.
- HMAC over payload: Always sign the raw body bytes, not a parsed JSON representation. JSON serialization is not deterministic across implementations.
- Consumer timeout: A consumer that hangs for 30s is treated as failed. Consumers should respond 200 immediately and process asynchronously if needed.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you prevent duplicate webhook deliveries?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use an idempotency key per event: a UUID stored on the WebhookEvent record and sent in the X-Idempotency-Key header. The consumer checks this key against a ProcessedEvent table before handling. If already processed, return 200 immediately without re-processing. On the sender side: FOR UPDATE SKIP LOCKED in PostgreSQL ensures only one worker claims each event. Together these prevent both duplicate delivery attempts and duplicate consumer processing.”}},{“@type”:”Question”,”name”:”What retry schedule should a webhook system use?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Exponential backoff with jitter: immediate → 1 min → 5 min → 30 min → 2 hours → 8 hours → 24 hours. After 7 failed attempts, move the event to a dead-letter queue (status=dead) and alert the account owner. Add jitter (±10%) to each delay so retries from many events do not all hammer the consumer at the same second. The maximum 24-hour delay covers temporary consumer outages without retrying indefinitely.”}},{“@type”:”Question”,”name”:”How do you verify webhook authenticity on the consumer side?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use HMAC-SHA256: the sender computes a signature over the raw request body using a per-endpoint secret key, and sends it in the X-Webhook-Signature header as sha256={hex_digest}. The consumer recomputes the HMAC using its stored secret and compares using constant-time comparison (hmac.compare_digest) to prevent timing attacks. Always sign the raw bytes before any parsing — JSON serialization is non-deterministic across libraries, so computing the HMAC over a re-serialized object may not match.”}},{“@type”:”Question”,”name”:”How do you scale webhook delivery workers without double-processing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use PostgreSQL FOR UPDATE SKIP LOCKED: each worker runs UPDATE WebhookEvent SET status=processing WHERE id IN (SELECT id FROM WebhookEvent WHERE status=pending AND next_attempt_at<=NOW() LIMIT 50 FOR UPDATE SKIP LOCKED) RETURNING *. Multiple workers execute this query simultaneously; each gets a unique, non-overlapping set of events. SKIP LOCKED skips rows already locked by another worker rather than waiting, giving each worker immediate results without contention.”}},{“@type”:”Question”,”name”:”What is a webhook dead-letter queue and how do you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A dead-letter queue (DLQ) holds events that exhausted all retry attempts. Purpose: prevents permanent data loss while separating permanently-failed events from healthy queue processing. Operators can inspect DLQ events to diagnose endpoint issues, fix problems, and replay events via a manual replay API (POST /webhooks/events/{id}/replay, which resets attempt_count=0 and status=pending). Provide a dashboard showing DLQ depth per endpoint so accounts can monitor their webhook health.”}}]}
Webhook retry and delivery reliability design is a key topic in Stripe system design interview questions.
Webhook delivery and retry system design is covered in Shopify system design interview preparation.
Event-driven webhook and retry system design is discussed in Amazon system design interview guide.