Q: How does HMAC webhook signing work and why is it needed?

Without signing, a recipient cannot verify that a webhook came from the platform and not from an attacker spoofing the source IP. HMAC (Hash-based Message Authentication Code) signing: the platform and the user share a secret key when the endpoint is registered. On each delivery: compute HMAC-SHA256(secret, timestamp + "." + payload_body). Include the signature and timestamp in the request headers (e.g., Webhook-Signature: t=1620000000,v1=abc123). The recipient recomputes the HMAC using their secret and compares with hmac.compare_digest() (constant-time comparison, prevents timing attacks). Verify the timestamp is within 5 minutes to prevent replay attacks — an attacker who captures a valid payload cannot replay it after 5 minutes. Rotate secrets periodically; support multiple active secrets during rotation.

Q: How do you handle a webhook endpoint that is consistently failing?

Consecutive failures indicate the endpoint is broken, the receiving server is down, or the user's integration has a bug. Failure handling: (1) Track failure_count on the WebhookEndpoint. Increment on each failed delivery attempt. (2) After 3 consecutive days of failures (all retries exhausted, delivery status=ABANDONED), automatically disable the endpoint: status=DISABLED. (3) Notify the user via email: "Your webhook endpoint has been disabled after 72 hours of consecutive failures." Include a link to re-enable and a log of failed deliveries. (4) When the user re-enables the endpoint, offer to replay recent missed events (last N events from the event log). (5) Circuit breaker pattern: if an endpoint fails 5 consecutive times in 10 minutes, skip it temporarily (circuit OPEN) and only retry after 30 minutes — avoids hammering a temporarily down server.

Q: How do you scale webhook delivery to 10 million events per day?

At 10M events/day = ~115 events/second average, with spikes to 10x = 1,150/second. Architecture: (1) Kafka topics per event type — ingest events at high throughput without backpressure on producers. (2) Fanout service: Kafka consumer that, for each event, looks up matching webhooks (from Redis cache) and enqueues WebhookDelivery jobs. With 100 matching endpoints per event: 115 * 100 = 11,500 delivery jobs/second. (3) Delivery workers: horizontally scaled pool that consumes from the delivery job queue, makes HTTP calls, and updates delivery status. HTTP calls are I/O-bound — each worker can handle 50-100 concurrent requests (async I/O / thread pool). 100 workers * 50 concurrent = 5,000 concurrent deliveries. (4) Delivery queue: Kafka with multiple partitions for parallelism, or a Redis-backed delayed job queue for retry scheduling. Monitor: delivery latency P99, delivery success rate, queue depth (backlog).

Question 1

How do webhooks differ from polling, and when should you use each?

Accepted Answer

Polling: the client periodically sends requests to the server to check for new events (GET /events?since=last_id). Simple to implement; client controls the check frequency. Downsides: latency (up to the polling interval), wasted requests when there are no events, server load from many polling clients. Webhooks (push): the server sends HTTP POST to the client when an event occurs. Near-real-time (seconds), no wasted requests, server initiates. Downsides: requires the client to have a publicly reachable URL, must handle retries and failures. Use polling when: the client is behind a firewall, events are infrequent and latency is not critical, or the client is a mobile app (push via APNs/FCM is better than HTTP webhooks). Use webhooks when: real-time event delivery is needed, the client is a server-side application with a public URL, and you want to avoid the cost of polling at scale.

Question 2

How do you ensure at-least-once delivery for webhooks?

Accepted Answer

At-least-once means every event is delivered at least once (possibly more than once on retries). Implementation: (1) Persist every WebhookDelivery record before attempting delivery — if the delivery worker crashes, the record is there for retry. (2) Only mark status=SUCCESS after receiving a 2xx response. If the delivery times out or receives a non-2xx, keep status=PENDING and schedule a retry. (3) A background job scans for PENDING deliveries with next_retry_at <= now and re-enqueues them. (4) Use exponential backoff: 5s, 30s, 2min, 10min, 30min, 2h, 6h, 24h. After 72 hours without success, mark ABANDONED. At-least-once means recipients must be idempotent — include a delivery_id in each payload and deduplicate on the receiving end.

Question 3

How does HMAC webhook signing work and why is it needed?

Accepted Answer

Without signing, a recipient cannot verify that a webhook came from the platform and not from an attacker spoofing the source IP. HMAC (Hash-based Message Authentication Code) signing: the platform and the user share a secret key when the endpoint is registered. On each delivery: compute HMAC-SHA256(secret, timestamp + "." + payload_body). Include the signature and timestamp in the request headers (e.g., Webhook-Signature: t=1620000000,v1=abc123). The recipient recomputes the HMAC using their secret and compares with hmac.compare_digest() (constant-time comparison, prevents timing attacks). Verify the timestamp is within 5 minutes to prevent replay attacks — an attacker who captures a valid payload cannot replay it after 5 minutes. Rotate secrets periodically; support multiple active secrets during rotation.

Question 4

How do you handle a webhook endpoint that is consistently failing?

Accepted Answer

Consecutive failures indicate the endpoint is broken, the receiving server is down, or the user's integration has a bug. Failure handling: (1) Track failure_count on the WebhookEndpoint. Increment on each failed delivery attempt. (2) After 3 consecutive days of failures (all retries exhausted, delivery status=ABANDONED), automatically disable the endpoint: status=DISABLED. (3) Notify the user via email: "Your webhook endpoint has been disabled after 72 hours of consecutive failures." Include a link to re-enable and a log of failed deliveries. (4) When the user re-enables the endpoint, offer to replay recent missed events (last N events from the event log). (5) Circuit breaker pattern: if an endpoint fails 5 consecutive times in 10 minutes, skip it temporarily (circuit OPEN) and only retry after 30 minutes — avoids hammering a temporarily down server.

Question 5

How do you scale webhook delivery to 10 million events per day?

Accepted Answer

At 10M events/day = ~115 events/second average, with spikes to 10x = 1,150/second. Architecture: (1) Kafka topics per event type — ingest events at high throughput without backpressure on producers. (2) Fanout service: Kafka consumer that, for each event, looks up matching webhooks (from Redis cache) and enqueues WebhookDelivery jobs. With 100 matching endpoints per event: 115 * 100 = 11,500 delivery jobs/second. (3) Delivery workers: horizontally scaled pool that consumes from the delivery job queue, makes HTTP calls, and updates delivery status. HTTP calls are I/O-bound — each worker can handle 50-100 concurrent requests (async I/O / thread pool). 100 workers * 50 concurrent = 5,000 concurrent deliveries. (4) Delivery queue: Kafka with multiple partitions for parallelism, or a Redis-backed delayed job queue for retry scheduling. Monitor: delivery latency P99, delivery success rate, queue depth (backlog).

Webhook Delivery System Low-Level Design

What is a Webhook?

Requirements

Data Model

Delivery Architecture

Fanout: Event to Endpoints

Delivery and Retry

Payload Signing (HMAC)

Ordering and Idempotency

Key Design Decisions