How do you ensure reliable webhook delivery with at-least-once guarantees?

Pipeline: event published to Kafka -> delivery worker sends HTTP POST to subscriber endpoint with HMAC-SHA256 signature -> subscriber returns 2xx within 5 seconds = success. Non-2xx or timeout = schedule retry. Retry strategy: exponential backoff with jitter (1min, 5min, 30min, 2h, 8h, 24h) over 3 days. Redis sorted set schedules delayed retries. After exhausting retries: dead letter queue. At-least-once means duplicates are possible (subscriber received but ack was lost). Subscribers must be idempotent (check event_id before processing). Per-subscription ordering via Kafka partitioning by subscription_id. Circuit breaker: if >90% of last 100 deliveries fail, pause and health-check every 5 minutes. Email subscriber admin about the failure. Auto-disable after 7 days of sustained failure.

How do you secure webhook deliveries against forgery?

HMAC-SHA256 signing: each request includes X-Webhook-Signature = HMAC-SHA256(timestamp + . + body, subscriber_secret). The subscriber computes the same HMAC with their stored secret and compares. Match = authentic. Replay prevention: the timestamp is included in the signature computation. Subscribers reject requests with timestamps >5 minutes old. This prevents captured webhooks from being replayed later. IP allowlisting: publish webhook delivery IP ranges. Subscribers restrict their endpoint to accept only from these IPs. Additional defense layer beyond signatures. Secret rotation: during rotation, sign with both old and new secrets (include both signatures). Subscriber verifies against either. After confirming new secret works, deprecate old. This enables zero-downtime rotation.

System Design: Design Webhook Delivery System — Reliable Delivery, Retry, Dead Letter Queue, Fan-Out, Rate Limiting

⏱ 6 min read

Webhooks are the standard mechanism for event-driven integration between services. Stripe notifies merchants of payments, GitHub notifies CI systems of pushes, and Shopify notifies apps of orders — all via webhooks. Designing a reliable webhook delivery system tests your understanding of at-least-once delivery, retry strategies, rate limiting, and monitoring — a focused system design question that appears at companies building platforms with webhook APIs.

Webhook Delivery Pipeline

When an event occurs (a payment succeeds, an order is created): (1) The source service publishes an event to Kafka: {event_id, event_type, payload, webhook_url, created_at}. (2) A delivery worker consumes the event, constructs the HTTP request (POST to the webhook URL with the event payload as JSON body), signs the request (HMAC-SHA256 with the subscriber secret key), and sends it. (3) If the subscriber returns 2xx within the timeout (5 seconds): delivery succeeded. Mark the event as delivered. (4) If the subscriber returns non-2xx, times out, or the connection fails: schedule a retry. (5) After exhausting all retries: move to the dead letter queue. The delivery must be at-least-once: every event is delivered at least one time. Duplicate deliveries are possible (the subscriber received the event but the acknowledgment was lost). The subscriber must handle duplicates idempotently (check the event_id before processing). Ordering: webhooks for the same subscription are delivered in order (the next event waits for the current to succeed or exhaust retries). Cross-subscription ordering is not guaranteed. Kafka partitioning by subscription_id ensures per-subscription ordering.

Retry Strategy

Exponential backoff with jitter: retry after 1 minute, 5 minutes, 30 minutes, 2 hours, 8 hours, 24 hours. Maximum 5-8 retries over 3 days. Why exponential: if the subscriber is down for maintenance (1 hour), retrying every 10 seconds floods them with queued events when they come back. Exponential backoff spaces retries to avoid overwhelming the recovering subscriber. Why jitter: without jitter, all failed deliveries for the same subscriber retry at the same time (synchronized retries). Adding random jitter (retry_time = base_delay * 2^attempt + random(0, base_delay)) spreads retries evenly. Implementation: use a delayed job queue. After a failed delivery: enqueue a retry job with a scheduled execution time = now + backoff_delay. Redis sorted sets work well: ZADD retry_queue {scheduled_time} {event_id}. A scheduler polls: ZRANGEBYSCORE retry_queue 0 {now} returns due retries. Retry metadata: track per-event: attempt_count, last_attempt_at, last_status_code, last_error_message. This helps subscribers debug delivery failures via a dashboard: “Your endpoint returned 500 Internal Server Error at 14:30. Next retry at 16:30.”

Subscriber Health and Circuit Breaking

If a subscriber endpoint is consistently failing (500 errors for days), continuing to send webhooks wastes resources and may overwhelm the subscriber when it recovers (a backlog of thousands of queued events). Circuit breaker per subscription: track the failure rate over the last N deliveries. If > 90% of the last 100 deliveries failed: open the circuit. In the open state: stop attempting deliveries. Queue events. Check the endpoint health every 5 minutes (send a ping or the oldest queued event). If the health check succeeds: close the circuit and begin draining the queued events (with rate limiting to avoid flooding). Subscriber notification: when the circuit opens, email the subscriber admin: “Your webhook endpoint https://api.example.com/webhooks has been failing for 24 hours. We have paused deliveries. Events are queued and will be delivered when your endpoint recovers.” This proactive communication prevents: “why did we miss 10,000 events?” questions. Endpoint disable after prolonged failure: if the circuit is open for 7+ days, disable the subscription. The subscriber must re-enable and verify their endpoint before deliveries resume. This prevents indefinite resource consumption for abandoned endpoints.

Security: Signing and Verification

Webhooks are HTTP requests to subscriber endpoints. Without verification, anyone could send fake events to the endpoint. Signing: each webhook request includes a signature header (e.g., X-Webhook-Signature). The signature is computed: HMAC-SHA256(request_body, subscriber_secret). The subscriber secret is shared during subscription setup. Verification: the subscriber computes HMAC-SHA256(received_body, their_stored_secret) and compares with the signature header. If they match: the request is authentic (only the platform and the subscriber know the secret). Replay prevention: include a timestamp in the signature computation. The subscriber rejects requests with timestamps older than 5 minutes. This prevents an attacker from capturing a valid webhook and replaying it later. Signature: HMAC-SHA256(timestamp + “.” + body, secret). The header includes both the timestamp and the signature. IP allowlisting: some subscribers restrict their endpoint to accept requests only from known IP ranges. The platform publishes its webhook delivery IP ranges. This provides an additional layer of defense beyond signature verification. Secret rotation: subscribers should be able to rotate their webhook secret without downtime. During rotation: the platform signs with both the old and new secrets, including both signatures. The subscriber verifies against either. After confirming the new secret works: deprecate the old one.

Scaling and Monitoring

Scale: a platform like Stripe delivers billions of webhook events per month. Architecture: (1) Kafka for event ingestion (partitioned by subscription_id for ordering). (2) A fleet of delivery workers consuming from Kafka. Each worker: dequeue an event, deliver, handle success/failure. Workers auto-scale based on queue depth. (3) Redis for retry scheduling (sorted set of delayed jobs). (4) PostgreSQL for event history and delivery status (subscribers query their delivery history via API/dashboard). Rate limiting per subscriber: some subscriber endpoints cannot handle high throughput. The platform enforces a maximum delivery rate per subscription (e.g., 100 events/second). Events exceeding the rate are queued and delivered at the subscriber pace. The subscriber can configure their rate limit via the dashboard. Monitoring: (1) Delivery success rate — overall and per-subscriber. Alert if the global success rate drops below 95% (indicates a platform-wide issue, not individual subscriber problems). (2) Delivery latency — time from event creation to successful delivery. P50 should be under 5 seconds. P99 under 30 seconds. (3) Retry rate — percentage of events requiring retries. High retry rate for a specific subscriber indicates their endpoint is unhealthy. (4) Dead letter queue depth — events that exhausted all retries. Should be near-zero. Non-zero indicates subscriber endpoints with prolonged failures. Dashboard for subscribers: show delivery history (event type, status, response code, timestamp), failed deliveries with error details, retry schedule, and a “test webhook” button to verify their endpoint.