A webhook delivery system pushes event notifications to customer-configured HTTP endpoints — the foundation of Stripe, GitHub, Shopify, and Twilio’s integration ecosystem. Core challenges: reliable delivery with retries and exponential backoff, signing payloads so receivers can verify authenticity, handling slow or failing endpoints without blocking other deliveries, and giving customers visibility into delivery attempts and failures.
Core Data Model
CREATE TABLE WebhookEndpoint (
endpoint_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
url TEXT NOT NULL,
signing_secret TEXT NOT NULL, -- HMAC key, shown once at creation
event_types TEXT[] NOT NULL DEFAULT '{}', -- subscribed event types
is_active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_webhook_user ON WebhookEndpoint (user_id);
CREATE TYPE delivery_status AS ENUM ('pending','succeeded','failed','retrying');
CREATE TABLE WebhookDelivery (
delivery_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
endpoint_id UUID NOT NULL REFERENCES WebhookEndpoint(endpoint_id),
event_type TEXT NOT NULL,
event_id TEXT NOT NULL, -- idempotency key
payload JSONB NOT NULL,
status delivery_status NOT NULL DEFAULT 'pending',
attempt_count SMALLINT NOT NULL DEFAULT 0,
next_attempt TIMESTAMPTZ,
last_http_code SMALLINT,
last_error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
delivered_at TIMESTAMPTZ
);
CREATE INDEX idx_delivery_due ON WebhookDelivery (next_attempt)
WHERE status IN ('pending','retrying');
CREATE UNIQUE INDEX idx_delivery_idempotency ON WebhookDelivery (endpoint_id, event_id);
Enqueuing Deliveries on Event
import hashlib, hmac, json, time
from uuid import uuid4
def enqueue_webhook_deliveries(conn, event_type: str, event_id: str, payload: dict):
"""
Find all active endpoints subscribed to this event type and create
a delivery record for each. Fan-out happens here; actual HTTP happens async.
"""
with conn.cursor() as cur:
cur.execute("""
SELECT endpoint_id FROM WebhookEndpoint
WHERE is_active = TRUE
AND (event_types = '{}' OR %s = ANY(event_types))
""", (event_type,))
endpoints = [row[0] for row in cur.fetchall()]
if not endpoints:
return
now_utc = __import__('datetime').datetime.now(__import__('datetime').timezone.utc)
with conn.cursor() as cur:
for endpoint_id in endpoints:
cur.execute("""
INSERT INTO WebhookDelivery
(delivery_id, endpoint_id, event_type, event_id, payload, next_attempt)
VALUES (%s,%s,%s,%s,%s,%s)
ON CONFLICT (endpoint_id, event_id) DO NOTHING
""", (str(uuid4()), endpoint_id, event_type, event_id,
__import__('psycopg2').extras.Json(payload), now_utc))
conn.commit()
Delivery Worker with Exponential Backoff
import requests, time
from datetime import datetime, timezone, timedelta
RETRY_SCHEDULE = [0, 30, 60, 300, 1800, 7200, 86400] # seconds after first attempt
MAX_ATTEMPTS = len(RETRY_SCHEDULE)
DELIVERY_TIMEOUT = 30 # seconds
def run_delivery_worker(conn):
import time as _time
while True:
deliver_due(conn)
_time.sleep(5)
def deliver_due(conn):
now = datetime.now(timezone.utc)
with conn.cursor() as cur:
cur.execute("""
UPDATE WebhookDelivery
SET status = 'retrying', attempt_count = attempt_count + 1
WHERE delivery_id IN (
SELECT delivery_id FROM WebhookDelivery
WHERE status IN ('pending','retrying')
AND next_attempt <= %s
ORDER BY next_attempt ASC
LIMIT 50
FOR UPDATE SKIP LOCKED
)
RETURNING delivery_id, endpoint_id, payload, event_type, event_id, attempt_count
""", (now,))
jobs = cur.fetchall()
conn.commit()
for delivery_id, endpoint_id, payload, event_type, event_id, attempt in jobs:
endpoint = load_endpoint(conn, endpoint_id)
if not endpoint:
continue
attempt_delivery(conn, delivery_id, endpoint, payload, event_type, event_id, attempt)
def attempt_delivery(conn, delivery_id, endpoint, payload, event_type, event_id, attempt):
url = endpoint['url']
secret = endpoint['signing_secret']
body = json.dumps(payload, default=str)
timestamp = str(int(time.time()))
# HMAC-SHA256 signature
sig_payload = f"{timestamp}.{body}"
signature = hmac.new(secret.encode(), sig_payload.encode(), hashlib.sha256).hexdigest()
headers = {
"Content-Type": "application/json",
"X-Webhook-Timestamp": timestamp,
"X-Webhook-Signature": f"sha256={signature}",
"X-Event-Type": event_type,
"X-Event-ID": event_id,
}
try:
resp = requests.post(url, data=body, headers=headers,
timeout=DELIVERY_TIMEOUT, allow_redirects=False)
success = 200 <= resp.status_code < 300
except requests.exceptions.RequestException as e:
success = False
resp = None
now = datetime.now(timezone.utc)
if success:
with conn.cursor() as cur:
cur.execute("""
UPDATE WebhookDelivery
SET status='succeeded', delivered_at=%s,
last_http_code=%s
WHERE delivery_id=%s
""", (now, resp.status_code if resp else None, delivery_id))
conn.commit()
else:
next_delay = RETRY_SCHEDULE[attempt] if attempt NOW() - interval '7 days'
""", (endpoint_id,))
fail_count = cur.fetchone()[0]
if fail_count >= 5:
with conn.cursor() as cur:
cur.execute(
"UPDATE WebhookEndpoint SET is_active=FALSE WHERE endpoint_id=%s",
(endpoint_id,)
)
conn.commit()
notify_endpoint_disabled(endpoint_id)
Key Interview Points
- Fan-out at event time, deliver async: Creating delivery records synchronously (during the triggering event) is fast — just DB inserts. Actual HTTP delivery is asynchronous via the worker. This means the original event handler returns immediately regardless of how many endpoints are subscribed or how slow they are. Fan-out records serve as the durable queue — no separate message broker needed for webhook delivery.
- Exponential backoff schedule: Retry delays of 0, 30s, 60s, 5m, 30m, 2h, 24h give a slow endpoint time to recover without retrying every second. After 7 attempts (~27 hours total), mark as failed. Notify the user: their endpoint is unreachable and deliveries are being dropped. The retry schedule is stored as a constant array — next_delay = RETRY_SCHEDULE[attempt_count].
- HMAC signature verification: Include the request timestamp in the signed payload (timestamp.body) and validate that the timestamp is within ±5 minutes. This prevents replay attacks: a valid webhook captured and re-sent 10 minutes later fails the timestamp check. Receivers verify: parse the X-Webhook-Signature header, compute HMAC(secret, timestamp.body), compare with constant-time hmac.compare_digest().
- Slow endpoint isolation: A single slow endpoint (30s timeouts) must not delay deliveries to other endpoints. Separate delivery worker pools per endpoint (or per user tier), or use SKIP LOCKED with per-endpoint parallelism limits. Alternatively, use a dedicated async HTTP client (httpx async) with concurrency limits so one slow endpoint doesn’t block worker threads.
- Delivery dashboard for customers: Show recent deliveries with status, HTTP response code, response body (first 500 chars), and retry timeline. Allow manual re-delivery of any delivery (creates a new delivery record for the same event_id — the receiver must be idempotent). This is what makes Stripe’s and GitHub’s webhook UX excellent — debugging failed webhooks is self-service.
Webhook delivery and event notification system design is discussed in Stripe system design interview questions.
Webhook delivery and merchant event notification design is covered in Shopify system design interview preparation.
Webhook delivery and integration event system design is discussed in Atlassian system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering