Webhook Service Low-Level Design: Delivery Guarantees, Retry Logic, and Signature Verification

Webhook Service: Low-Level Design

A webhook service enables event-driven integrations by reliably delivering HTTP POST notifications to subscriber endpoints whenever internal events occur. The core challenges are at-least-once delivery in the face of unreachable endpoints, scaling fan-out when thousands of subscribers listen to a single event type, and providing operators with visibility into delivery health.

Requirements

Functional

  • Accept event publications from internal producers and fan out to all active subscriber endpoints
  • Guarantee at-least-once delivery with configurable retry schedules
  • Sign each delivery with an HMAC-SHA256 signature so subscribers can verify authenticity
  • Record delivery attempts, response codes, and latencies in a queryable delivery log
  • Allow subscribers to register, update, and deactivate webhook endpoints via API
  • Automatically disable endpoints after N consecutive failures (circuit breaker)

Non-Functional

  • First delivery attempt within 2 seconds of event publication p95
  • Support 10,000 active subscriptions and 5,000 events per second
  • Delivery log retention of 30 days

Data Model

  • subscriptions: subscription_id (UUID), owner_id, event_type (TEXT), endpoint_url (TEXT), secret (BYTEA — stored encrypted), status (ENUM: active, paused, disabled), consecutive_failures (INT), created_at
  • events: event_id (UUID), event_type (TEXT), payload (JSONB), source_service (TEXT), published_at (TIMESTAMP)
  • delivery_attempts: attempt_id (UUID), event_id, subscription_id, attempt_number (INT), scheduled_at, attempted_at, response_code (INT), response_body (TEXT), duration_ms (INT), status (ENUM: pending, success, failed, abandoned)

Events and delivery_attempts are stored in a time-series-friendly table partitioned by week. Old partitions are dropped on a rolling basis to enforce the 30-day retention window.

Core Algorithms

HMAC Signature Generation

Each delivery includes a header X-Webhook-Signature: sha256=HEX. The signature is computed as HMAC-SHA256 of the concatenation of the timestamp (Unix epoch string) and the raw request body, using the subscription-specific secret as the key. Subscribers verify by recomputing the HMAC and comparing with a constant-time equality check to prevent timing attacks. A 5-minute timestamp window is enforced to block replay attacks.

Exponential Backoff Retry

On delivery failure (non-2xx response or connection timeout), the worker schedules the next attempt with delay = min(initial_delay * 2^(attempt_number – 1) + jitter, max_delay). Default configuration: initial_delay 30 seconds, max_delay 3600 seconds (1 hour), max_attempts 10. Jitter is a random value in the range 0 to 10 percent of the computed delay, preventing thundering herd retries across many subscriptions.

Circuit Breaker

After 5 consecutive failures, the subscription status transitions to disabled. An automated re-enable probe runs every 24 hours: it sends a synthetic test event and re-activates the subscription on success. Owners are notified via email when a subscription is disabled.

Scalability and Architecture

Event publication is synchronous to Kafka (producer with acks=all). A fan-out service consumes the event topic, queries active subscriptions matching the event_type, and enqueues one delivery task per subscription into a Redis-backed queue (using sorted sets keyed by scheduled_at for delay support). A pool of delivery workers pops tasks, performs the HTTP POST with a 10-second timeout, and writes the result to delivery_attempts.

  • Fan-out worker is stateless and horizontally scalable; partitioning by subscription_id ensures ordered delivery per subscription
  • Delivery workers maintain a per-endpoint connection pool (HTTP keep-alive) to reduce TLS overhead for high-frequency deliveries
  • Failed tasks are re-enqueued with the computed backoff delay using ZADD score = Unix timestamp of next attempt
  • A sweeper process scans for pending tasks past their scheduled_at and moves them to active queue
  • Metrics: Prometheus counters for delivery success rate, histogram for delivery latency, gauge for queue depth per event type

API Design

Subscription Management

  • POST /v1/webhooks/subscriptions — body: {event_type, endpoint_url, secret}. Returns subscription_id and a generated signing secret if not provided.
  • GET /v1/webhooks/subscriptions/{subscription_id} — returns subscription metadata (secret is never returned after creation)
  • PATCH /v1/webhooks/subscriptions/{subscription_id} — update endpoint_url or rotate secret
  • DELETE /v1/webhooks/subscriptions/{subscription_id} — deactivate subscription

Delivery Logs

GET /v1/webhooks/subscriptions/{subscription_id}/deliveries?start=ISO8601&status=STRING&limit=INT — paginated delivery log with request/response details for debugging.

Manual Retry

POST /v1/webhooks/deliveries/{attempt_id}/retry — re-enqueue a failed delivery immediately, bypassing backoff schedule. Useful for operator-initiated recovery.

Interview Tips

Interviewers often ask about exactly-once vs at-least-once delivery. Explain that exactly-once requires subscriber idempotency (expose event_id in the payload so subscribers can deduplicate) rather than guarantees at the transport layer. Discuss the ordering problem: if two events for the same resource are delivered out of order, subscribers should use event timestamps rather than arrival order to apply state changes. Also address endpoint security: validate that subscriber-provided URLs do not point to internal RFC 1918 addresses (SSRF prevention) before storing the subscription.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a webhook service guarantee at-least-once delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Events are persisted to a durable delivery log before any attempt is made. A worker dequeues the event, delivers it, and marks it delivered only on a confirmed 2xx response from the subscriber endpoint. If the worker crashes mid-attempt, the event remains in an unacknowledged state and is re-dispatched after a visibility timeout, ensuring no event is silently dropped.”
}
},
{
“@type”: “Question”,
“name”: “How is exponential backoff with retry implemented for failed webhook deliveries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On a non-2xx response or network error, the delivery is rescheduled with a delay of min(base * 2^attempt + jitter, max_delay). Typical parameters: base=1s, max=24h, jitter=±20% of the computed delay. After a configurable max attempt count (e.g., 24 attempts over ~24h) the event is moved to a dead-letter queue and the subscription is flagged for review.”
}
},
{
“@type”: “Question”,
“name”: “How does HMAC-SHA256 signature verification work for webhook payloads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The sender computes HMAC-SHA256 over the raw request body using a per-subscription secret key, then includes the hex digest in a request header (e.g., X-Signature-256). The receiver recomputes the HMAC using its copy of the secret and compares using a constant-time equality check to prevent timing attacks. Requests with missing or mismatched signatures are rejected with 401.”
}
},
{
“@type”: “Question”,
“name”: “What does the delivery log store and how is it queried?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each delivery log row captures: event ID, subscription ID, payload hash, attempt number, HTTP status, response latency, and timestamp. It is append-only and partitioned by subscription ID + date for efficient per-subscription history queries. Operators query it to debug failed deliveries, and the scheduler uses it to reconstruct retry state after restarts.”
}
}
]
}

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top