A webhook delivery system sends HTTP callbacks to external URLs when events occur in your system — Stripe sends webhook events for payment.succeeded, GitHub sends events for push and pull_request, Shopify sends events for order.created. The design must handle delivery reliability (at-least-once), retries with backoff, endpoint failures, and security (verifying webhook authenticity). At scale (millions of events per day), webhooks require a robust async delivery pipeline.
Webhook Registration and Event Routing
Webhook configuration: users register endpoint URLs and select which event types to receive. Schema: webhook_subscriptions (id, user_id, endpoint_url, event_types JSONB, secret, status, created_at). Event types: [“payment.succeeded”, “payment.failed”] — the system only sends events matching the subscription. Payload construction: when a payment.succeeded event occurs internally, the delivery system: (1) Queries subscriptions WHERE user_id=? AND event_types @> ‘[“payment.succeeded”]’. (2) For each matching subscription, creates a delivery record: webhook_deliveries (id, subscription_id, event_type, payload JSONB, status, next_attempt_at, attempt_count). (3) Enqueues the delivery to a task queue (Kafka, SQS, Celery). Fanout for popular event types: if 10,000 subscriptions match an event, creating 10,000 delivery records and enqueuing 10,000 tasks may be slow. Batch fanout: a fanout worker reads the subscription list in chunks and enqueues deliveries. Use a message queue with ordering guarantees to ensure each subscription receives the event.
Delivery and Retry Logic
Webhook delivery is inherently unreliable — the recipient’s server may be down, overloaded, or return an error. At-least-once delivery: send the webhook and retry until the recipient confirms receipt (HTTP 2xx response). Retry schedule (exponential backoff with jitter): attempt 1: immediate. Attempt 2: 5 minutes. Attempt 3: 30 minutes. Attempt 4: 2 hours. Attempt 5: 8 hours. Attempt 6: 24 hours. After 6 failed attempts: mark as permanently failed and alert the user. Timeout: treat as failure if the endpoint doesn’t respond within 5-30 seconds — long-running webhook handlers should respond 200 immediately and process asynchronously. Delivery tracking: update webhook_deliveries status on each attempt: PENDING → DELIVERING → DELIVERED | FAILED | RETRYING. Store the response code and response body of the last attempt for debugging. User dashboard: show delivery history per endpoint — which events were delivered, which failed, and the error (HTTP 503, connection refused, timeout).
Webhook Security and Verification
Webhook recipients must verify that events came from your system, not a malicious third party. HMAC signature: when creating a subscription, generate a secret (32 random bytes, base64 encoded). On delivery, compute HMAC-SHA256(secret, request_body_bytes) and include in the request header (e.g., X-Webhook-Signature: sha256={hex_hmac}). The recipient recomputes the HMAC with their stored secret and compares. If they match, the request is authentic. The comparison must be timing-safe (constant-time string comparison) to prevent timing attacks. Timestamp inclusion: include the event timestamp in the signed payload (or as a separate X-Webhook-Timestamp header, included in the HMAC). Recipients reject events older than 5 minutes — prevents replay attacks (an attacker capturing and resending a webhook). Rotating secrets: provide an API to rotate the webhook secret. During rotation, support two active secrets simultaneously (old and new) for a grace period — allows recipients to update their secret without missing webhooks during the transition.
Endpoint Health and Disabling
Repeatedly delivering to a persistently failing endpoint wastes resources. Endpoint health tracking: track consecutive failures per subscription. After N consecutive failures (e.g., 50), mark the subscription as disabled and stop sending events. Notify the user via email that their endpoint has been disabled and why. Reactivation: user fixes their endpoint and manually re-enables the subscription. On re-enable, optionally replay missed events (useful for critical business events) or resume with new events only (simpler). Circuit breaker per endpoint: if an endpoint returns 5xx for 5 consecutive attempts within 10 minutes, open the circuit — wait 30 minutes before retrying that endpoint for any subscription. This prevents thousands of retries from hammering a struggling server. DNS blacklisting: prevent users from registering webhook URLs pointing to internal services (SSRF protection). Validate the endpoint URL: reject private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback (127.0.0.1), and link-local addresses (169.254.0.0/16) at registration time.
Ordering and Deduplication
Webhooks may be delivered out of order — network delays and retries mean event #2 may arrive before event #1 at the recipient. Each webhook event should include an event_id (UUID) and sequence_number (monotonically increasing per entity). Recipients who need ordered processing: buffer events and process in sequence_number order. Gap detection: if sequence 5 is received but sequence 4 hasn’t arrived, hold 5 until 4 arrives or a timeout expires. This is complex — most webhook systems don’t guarantee ordering and document this explicitly. Deduplication at the recipient: at-least-once delivery means duplicate events are possible. Each event has a unique event_id. Recipients should store processed event_ids (in Redis or a database with a unique constraint) and skip duplicates. The event_id is included in every delivery and retry — the same event always has the same ID. The delivery system should also deduplicate: if a delivery task is accidentally enqueued twice (idempotent task queue), the unique constraint on webhook_deliveries prevents creating two records for the same event + subscription pair.