Webhook Retry System Low-Level Design – Tech Interview Dot Org

Webhook Retry System — Low-Level Design

A webhook retry system reliably delivers HTTP callbacks to external consumer endpoints, handling transient failures through exponential backoff, dead-letter queues, and idempotency guarantees. This is asked at companies like Stripe, Shopify, and Twilio where webhook reliability is a core product feature.

Core Data Model

WebhookEndpoint
  id              BIGSERIAL PK
  account_id      BIGINT NOT NULL
  url             TEXT NOT NULL
  secret          TEXT NOT NULL        -- for HMAC signature
  enabled         BOOLEAN DEFAULT true
  created_at      TIMESTAMPTZ

WebhookEvent
  id              BIGSERIAL PK
  endpoint_id     BIGINT FK
  event_type      TEXT NOT NULL        -- payment.succeeded, order.created
  payload         JSONB NOT NULL
  idempotency_key TEXT UNIQUE NOT NULL -- prevents duplicate delivery
  status          TEXT DEFAULT 'pending' -- pending, delivered, failed, dead
  attempt_count   INT DEFAULT 0
  next_attempt_at TIMESTAMPTZ
  delivered_at    TIMESTAMPTZ
  created_at      TIMESTAMPTZ

WebhookAttempt
  id              BIGSERIAL PK
  event_id        BIGINT FK
  attempted_at    TIMESTAMPTZ
  http_status     INT
  response_body   TEXT
  duration_ms     INT
  error_message   TEXT

Retry Schedule with Exponential Backoff

The standard retry schedule for webhook delivery:

Attempt 1: immediate
Attempt 2: 1 minute after failure
Attempt 3: 5 minutes after failure
Attempt 4: 30 minutes after failure
Attempt 5: 2 hours after failure
Attempt 6: 8 hours after failure
Attempt 7: 24 hours after failure
After attempt 7: move to dead-letter queue, status=dead

Implementation using next_attempt_at:

RETRY_DELAYS = [0, 60, 300, 1800, 7200, 28800, 86400]  # seconds

def schedule_retry(event_id, attempt_count):
    if attempt_count >= len(RETRY_DELAYS):
        db.execute("""
            UPDATE WebhookEvent
            SET status='dead'
            WHERE id=%(id)s
        """, {'id': event_id})
        return
    delay = RETRY_DELAYS[attempt_count]
    db.execute("""
        UPDATE WebhookEvent
        SET next_attempt_at = NOW() + INTERVAL '%(delay)s seconds',
            attempt_count = %(count)s
        WHERE id=%(id)s
    """, {'delay': delay, 'count': attempt_count, 'id': event_id})

Worker: Polling for Due Events

-- Claim a batch of due events atomically (skip-locked prevents double-processing)
UPDATE WebhookEvent
SET status='processing'
WHERE id IN (
    SELECT id FROM WebhookEvent
    WHERE status='pending'
      AND next_attempt_at <= NOW()
    ORDER BY next_attempt_at
    LIMIT 50
    FOR UPDATE SKIP LOCKED
)
RETURNING *;

After claiming, the worker delivers each event and records the attempt:

def deliver_event(event):
    endpoint = db.get(WebhookEndpoint, event.endpoint_id)
    signature = hmac_sign(endpoint.secret, event.payload)

    try:
        resp = http_post(
            url=endpoint.url,
            body=event.payload,
            headers={
                'X-Webhook-Signature': signature,
                'X-Webhook-Event': event.event_type,
                'X-Idempotency-Key': event.idempotency_key,
            },
            timeout=30
        )
        success = 200 <= resp.status_code < 300
    except TimeoutError:
        resp = None
        success = False

    db.insert(WebhookAttempt, {
        'event_id': event.id,
        'http_status': resp.status_code if resp else None,
        'response_body': resp.text[:1000] if resp else None,
        'error_message': None if success else 'timeout or non-2xx',
    })

    if success:
        db.execute("UPDATE WebhookEvent SET status='delivered', delivered_at=NOW() WHERE id=%(id)s", {'id': event.id})
    else:
        schedule_retry(event.id, event.attempt_count + 1)

HMAC Signature Verification (Consumer Side)

def verify_webhook(secret, raw_body, signature_header):
    expected = hmac.new(
        secret.encode(),
        raw_body,
        hashlib.sha256
    ).hexdigest()
    # Constant-time comparison to prevent timing attacks
    return hmac.compare_digest(f'sha256={expected}', signature_header)

The consumer must verify the signature before processing the payload. Use constant-time comparison to prevent timing-attack extraction of the secret.

Idempotency on the Consumer Side

Webhooks are delivered at-least-once — the consumer must be idempotent. Use the X-Idempotency-Key header:

def handle_payment_succeeded(idempotency_key, payload):
    # Check if already processed
    if ProcessedEvent.exists(idempotency_key=idempotency_key):
        return  # Already handled, safe to ignore

    # Process the event
    order_id = payload['order_id']
    mark_order_paid(order_id)

    # Record as processed AFTER successful handling
    ProcessedEvent.create(
        idempotency_key=idempotency_key,
        processed_at=now()
    )

Dead-Letter Queue and Manual Replay

Events in status=dead need operator attention. Provide a replay API:

POST /webhooks/events/{event_id}/replay
-- Resets status to 'pending', attempt_count to 0, next_attempt_at to NOW()

GET /webhooks/endpoints/{endpoint_id}/events?status=dead
-- Lists all dead events for investigation

GET /webhooks/events/{event_id}/attempts
-- Full delivery attempt log for debugging

Scaling the Delivery Workers

Horizontal scaling: FOR UPDATE SKIP LOCKED ensures multiple workers don’t double-deliver the same event. Add workers freely.
Per-endpoint rate limiting: Limit delivery concurrency per endpoint (max 5 in-flight) to avoid overwhelming consumer servers.
Timeout per delivery: Cap HTTP requests at 30 seconds. Slow consumers should not block workers.
Separate queues by priority: New events go to a fast queue; retry events go to a separate queue so fresh deliveries aren’t delayed by a backlog of retries.

Key Interview Points

At-least-once vs exactly-once: Webhooks are at-least-once. Exactly-once requires distributed consensus — too expensive. Design consumers to be idempotent instead.
FOR UPDATE SKIP LOCKED: The correct pattern for multi-worker job queues in PostgreSQL. Prevents double-processing without a separate locking system.
HMAC over payload: Always sign the raw body bytes, not a parsed JSON representation. JSON serialization is not deterministic across implementations.
Consumer timeout: A consumer that hangs for 30s is treated as failed. Consumers should respond 200 immediately and process asynchronously if needed.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you prevent duplicate webhook deliveries?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use an idempotency key per event: a UUID stored on the WebhookEvent record and sent in the X-Idempotency-Key header. The consumer checks this key against a ProcessedEvent table before handling. If already processed, return 200 immediately without re-processing. On the sender side: FOR UPDATE SKIP LOCKED in PostgreSQL ensures only one worker claims each event. Together these prevent both duplicate delivery attempts and duplicate consumer processing.”}},{“@type”:”Question”,”name”:”What retry schedule should a webhook system use?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Exponential backoff with jitter: immediate → 1 min → 5 min → 30 min → 2 hours → 8 hours → 24 hours. After 7 failed attempts, move the event to a dead-letter queue (status=dead) and alert the account owner. Add jitter (±10%) to each delay so retries from many events do not all hammer the consumer at the same second. The maximum 24-hour delay covers temporary consumer outages without retrying indefinitely.”}},{“@type”:”Question”,”name”:”How do you verify webhook authenticity on the consumer side?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use HMAC-SHA256: the sender computes a signature over the raw request body using a per-endpoint secret key, and sends it in the X-Webhook-Signature header as sha256={hex_digest}. The consumer recomputes the HMAC using its stored secret and compares using constant-time comparison (hmac.compare_digest) to prevent timing attacks. Always sign the raw bytes before any parsing — JSON serialization is non-deterministic across libraries, so computing the HMAC over a re-serialized object may not match.”}},{“@type”:”Question”,”name”:”How do you scale webhook delivery workers without double-processing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use PostgreSQL FOR UPDATE SKIP LOCKED: each worker runs UPDATE WebhookEvent SET status=processing WHERE id IN (SELECT id FROM WebhookEvent WHERE status=pending AND next_attempt_at<=NOW() LIMIT 50 FOR UPDATE SKIP LOCKED) RETURNING *. Multiple workers execute this query simultaneously; each gets a unique, non-overlapping set of events. SKIP LOCKED skips rows already locked by another worker rather than waiting, giving each worker immediate results without contention.”}},{“@type”:”Question”,”name”:”What is a webhook dead-letter queue and how do you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A dead-letter queue (DLQ) holds events that exhausted all retry attempts. Purpose: prevents permanent data loss while separating permanently-failed events from healthy queue processing. Operators can inspect DLQ events to diagnose endpoint issues, fix problems, and replay events via a manual replay API (POST /webhooks/events/{id}/replay, which resets attempt_count=0 and status=pending). Provide a dashboard showing DLQ depth per endpoint so accounts can monitor their webhook health.”}}]}

Webhook retry and delivery reliability design is a key topic in Stripe system design interview questions.

Webhook delivery and retry system design is covered in Shopify system design interview preparation.

Event-driven webhook and retry system design is discussed in Amazon system design interview guide.