What is a Webhook?
A webhook is a user-defined HTTP callback. When an event occurs in a platform (payment succeeded, order shipped, file uploaded), the platform sends an HTTP POST request to a URL configured by the user. Webhooks power integrations: Stripe notifies your server of payment events, GitHub notifies CI/CD pipelines of push events, Shopify notifies fulfillment services of new orders.
Requirements
- Allow users to register webhook endpoints (URL, event types to subscribe to)
- Deliver events to registered endpoints within 5 seconds
- At-least-once delivery: retry on failure with exponential backoff
- Secure payloads (HMAC signature), prevent replay attacks
- 10M events/day, 1M registered webhooks
Data Model
WebhookEndpoint(endpoint_id UUID, user_id, url VARCHAR, secret VARCHAR,
event_types VARCHAR[], status ENUM(ACTIVE,DISABLED),
created_at, last_success_at, failure_count)
WebhookDelivery(delivery_id UUID, endpoint_id, event_id, event_type,
payload JSONB, status ENUM(PENDING,SUCCESS,FAILED,ABANDONED),
attempt_count, next_retry_at, created_at, delivered_at,
response_code, response_body)
Delivery Architecture
Event Source → Kafka (event_type topic) → Webhook Fanout Service
→ For each matching endpoint:
WebhookDelivery record created (PENDING)
→ Delivery Queue (Kafka or delayed job)
→ Webhook Delivery Worker
→ HTTP POST to endpoint URL
→ Update delivery status
Fanout: Event to Endpoints
When event E of type payment.succeeded occurs: query all WebhookEndpoints where user_id=owner AND event_types contains payment.succeeded AND status=ACTIVE. For large platforms (many endpoints), cache the endpoint lookup in Redis: key=endpoints:{user_id}:{event_type}, TTL=5min. For each matching endpoint, create a WebhookDelivery record and enqueue a delivery job.
Delivery and Retry
Delivery worker sends HTTP POST to the endpoint URL with timeout=30s. On success (2xx response): update delivery status=SUCCESS, reset endpoint.failure_count=0. On failure (non-2xx, timeout, DNS failure): increment attempt_count, schedule retry with exponential backoff:
retry_delays = [5s, 30s, 2min, 10min, 30min, 2h, 6h, 24h] # attempt_count 1: retry after 5s # attempt_count 2: retry after 30s # attempt_count 8: retry after 24h # attempt_count > 8: status=ABANDONED
After N consecutive failures (e.g., 3 days of failures), disable the endpoint: status=DISABLED, notify the user via email.
Payload Signing (HMAC)
Sign each payload with the endpoint’s secret key so the recipient can verify authenticity:
import hmac, hashlib, time
def sign_payload(secret, payload_bytes, timestamp):
signed_content = f"{timestamp}.".encode() + payload_bytes
signature = hmac.new(secret.encode(), signed_content, hashlib.sha256).hexdigest()
return f"t={timestamp},v1={signature}"
# Delivery adds headers:
# Webhook-Signature: t=1620000000,v1=abc123...
# Webhook-Timestamp: 1620000000
# Recipient verifies:
def verify(secret, payload_bytes, timestamp_str, signature_str):
if abs(time.time() - int(timestamp_str)) > 300: # 5 min tolerance
raise ReplayAttack
expected = sign_payload(secret, payload_bytes, timestamp_str)
return hmac.compare_digest(expected, signature_str)
Ordering and Idempotency
Webhook deliveries may arrive out of order due to retries. Include event_id and created_at in each payload. Recipients should use event_id for idempotency (deduplicate on their end) and created_at to detect out-of-order delivery. Provide a delivery_id to allow idempotent retries — if the recipient processed a delivery but the acknowledgment timed out, the re-delivered payload has the same delivery_id.
Key Design Decisions
- Kafka for event ingestion — decouples event sources from the fanout service
- WebhookDelivery record per delivery — full audit trail, enables re-delivery
- Exponential backoff with cap — respects failing endpoints, reduces thundering herd
- HMAC-SHA256 signature — recipient can verify authenticity without TLS inspection
- Timestamp in signature — prevents replay attacks with a 5-minute tolerance window
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do webhooks differ from polling, and when should you use each?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Polling: the client periodically sends requests to the server to check for new events (GET /events?since=last_id). Simple to implement; client controls the check frequency. Downsides: latency (up to the polling interval), wasted requests when there are no events, server load from many polling clients. Webhooks (push): the server sends HTTP POST to the client when an event occurs. Near-real-time (seconds), no wasted requests, server initiates. Downsides: requires the client to have a publicly reachable URL, must handle retries and failures. Use polling when: the client is behind a firewall, events are infrequent and latency is not critical, or the client is a mobile app (push via APNs/FCM is better than HTTP webhooks). Use webhooks when: real-time event delivery is needed, the client is a server-side application with a public URL, and you want to avoid the cost of polling at scale.”}},{“@type”:”Question”,”name”:”How do you ensure at-least-once delivery for webhooks?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”At-least-once means every event is delivered at least once (possibly more than once on retries). Implementation: (1) Persist every WebhookDelivery record before attempting delivery — if the delivery worker crashes, the record is there for retry. (2) Only mark status=SUCCESS after receiving a 2xx response. If the delivery times out or receives a non-2xx, keep status=PENDING and schedule a retry. (3) A background job scans for PENDING deliveries with next_retry_at <= now and re-enqueues them. (4) Use exponential backoff: 5s, 30s, 2min, 10min, 30min, 2h, 6h, 24h. After 72 hours without success, mark ABANDONED. At-least-once means recipients must be idempotent — include a delivery_id in each payload and deduplicate on the receiving end.”}},{“@type”:”Question”,”name”:”How does HMAC webhook signing work and why is it needed?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Without signing, a recipient cannot verify that a webhook came from the platform and not from an attacker spoofing the source IP. HMAC (Hash-based Message Authentication Code) signing: the platform and the user share a secret key when the endpoint is registered. On each delivery: compute HMAC-SHA256(secret, timestamp + "." + payload_body). Include the signature and timestamp in the request headers (e.g., Webhook-Signature: t=1620000000,v1=abc123). The recipient recomputes the HMAC using their secret and compares with hmac.compare_digest() (constant-time comparison, prevents timing attacks). Verify the timestamp is within 5 minutes to prevent replay attacks — an attacker who captures a valid payload cannot replay it after 5 minutes. Rotate secrets periodically; support multiple active secrets during rotation.”}},{“@type”:”Question”,”name”:”How do you handle a webhook endpoint that is consistently failing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Consecutive failures indicate the endpoint is broken, the receiving server is down, or the user's integration has a bug. Failure handling: (1) Track failure_count on the WebhookEndpoint. Increment on each failed delivery attempt. (2) After 3 consecutive days of failures (all retries exhausted, delivery status=ABANDONED), automatically disable the endpoint: status=DISABLED. (3) Notify the user via email: "Your webhook endpoint has been disabled after 72 hours of consecutive failures." Include a link to re-enable and a log of failed deliveries. (4) When the user re-enables the endpoint, offer to replay recent missed events (last N events from the event log). (5) Circuit breaker pattern: if an endpoint fails 5 consecutive times in 10 minutes, skip it temporarily (circuit OPEN) and only retry after 30 minutes — avoids hammering a temporarily down server.”}},{“@type”:”Question”,”name”:”How do you scale webhook delivery to 10 million events per day?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”At 10M events/day = ~115 events/second average, with spikes to 10x = 1,150/second. Architecture: (1) Kafka topics per event type — ingest events at high throughput without backpressure on producers. (2) Fanout service: Kafka consumer that, for each event, looks up matching webhooks (from Redis cache) and enqueues WebhookDelivery jobs. With 100 matching endpoints per event: 115 * 100 = 11,500 delivery jobs/second. (3) Delivery workers: horizontally scaled pool that consumes from the delivery job queue, makes HTTP calls, and updates delivery status. HTTP calls are I/O-bound — each worker can handle 50-100 concurrent requests (async I/O / thread pool). 100 workers * 50 concurrent = 5,000 concurrent deliveries. (4) Delivery queue: Kafka with multiple partitions for parallelism, or a Redis-backed delayed job queue for retry scheduling. Monitor: delivery latency P99, delivery success rate, queue depth (backlog).”}}]}
Stripe system design covers webhook delivery and event systems. See common questions for Stripe interview: webhook and event delivery system design.
Shopify system design covers webhook delivery for merchant integrations. Review patterns for Shopify interview: webhook and integration system design.
Atlassian system design covers webhook and event notification systems. See design patterns for Atlassian interview: webhook and event notification system design.