Notification Delivery Service Low-Level Design: Multi-Channel Dispatch, Priority Queues, and Delivery Tracking

Notification Schema

Every notification is a structured record created before any delivery is attempted:

  • notification_id: UUID or Snowflake — unique identifier, used for deduplication
  • user_id: target recipient
  • type: marketing | transactional | security — determines default channel policy and override permissions
  • title, body: rendered content for display channels
  • channels_requested[]: which channels the sender wants to use (push, email, SMS, in-app)
  • data{}: arbitrary key-value payload for deep-link routing in the receiving app
  • priority: low | normal | high | critical — controls queue routing and retry urgency
  • idempotency_key: caller-supplied key for deduplication — prevents duplicate sends on retry
  • created_at: used for TTL enforcement and analytics

Channel Dispatch Flow

When a notification is created, the dispatch pipeline executes before any message leaves the system:

  1. Preference check: load the user's per-channel opt-in/opt-out settings from the preferences store
  2. Channel filtering: intersect channels_requested with user's opted-in channels; security notifications bypass marketing opt-outs
  3. Priority resolution: determine the effective priority; security notifications always escalate to high/critical
  4. Queue routing: publish one task per approved channel to the appropriate priority queue

User preferences are cached in Redis (TTL 5 minutes) to avoid a database read on every notification.

Priority Queues

Separate queues per priority level ensure critical notifications are never starved by marketing volume:

  • Critical queue: polled continuously; workers assigned exclusively — used for security alerts, 2FA codes, payment failures
  • High queue: polled every few seconds; shared workers with critical fallback
  • Normal queue: polled every 30 seconds; standard transactional notifications
  • Low queue: polled infrequently (minutes); marketing, digest emails, weekly reports

During traffic spikes, low-priority queues grow while critical queues drain immediately. This is intentional — a promotional email arriving 10 minutes late is acceptable; a password reset code delayed by 10 minutes is not.

Push Notification Worker

Push workers interface with mobile platform providers:

  • APNs (Apple Push Notification service): HTTP/2 API with JWT token authentication (rotate every 60 minutes); connection pooling critical — APNs limits connections per app; send up to 1000 notifications per second per connection
  • FCM (Firebase Cloud Messaging): HTTP v1 API with OAuth 2.0 service account; supports topic messaging for broadcast use cases
  • Token lifecycle: APNs returns status 410 (Gone) for unregistered tokens — remove from DB immediately to avoid repeated failed sends; FCM provides a new canonical registration token in the response when a token is refreshed

Email and SMS Workers

Email worker: sends via SES or SendGrid transactional API. Rate limits are per sending domain (e.g., 14 sends/second on SES default). Handle bounce callbacks via SNS webhook — hard bounces must be removed from the active address list immediately to protect domain reputation.

SMS worker: sends via Twilio or Amazon SNS. Phone numbers must be in E.164 format (+14155551234). Delivery receipts arrive asynchronously via webhook — update notification status on receipt. Maintain an opt-out registry: numbers that replied STOP must never be messaged again (regulatory requirement in most jurisdictions).

In-App Notification Worker

In-app notifications are stored in a notifications table and delivered over WebSocket if the user is connected:

  • Write the notification record with status PENDING
  • Check the presence service — if user has an active WebSocket connection, push immediately and mark DELIVERED
  • If offline, leave as PENDING; client fetches unread notifications on next app open
  • In-app notifications do not require provider integration and have effectively zero delivery cost

Delivery Status Tracking

Each notification per channel follows a state machine:

PENDING → DISPATCHED → DELIVERED
                    ↘ FAILED
  • PENDING: queued, not yet sent to provider
  • DISPATCHED: submitted to provider API successfully; awaiting delivery confirmation
  • DELIVERED: provider confirmed delivery (push/SMS delivery receipt, email open event, in-app ACK)
  • FAILED: non-retryable error or max retries exhausted

Status transitions are written to a notification_events log table for auditability and analytics.

Retry Policy with Exponential Backoff

Transient failures (provider rate limit, network timeout) are retried with exponential backoff:

  • Retry delays: 1s, 2s, 4s, 8s, 16s — max 5 attempts before marking FAILED
  • Add jitter (±20%) to prevent thundering herd when many notifications retry simultaneously after a provider outage

Non-retryable errors terminate immediately without retry:

  • Invalid device token (push) — token is stale, remove from DB
  • User opted out (email/SMS) — add to suppression list
  • Invalid phone number format

Deduplication and Analytics

Deduplication: before dispatching any channel task, check Redis with SET notification:{idempotency_key} 1 NX EX 86400. If the key already exists, the notification was already sent — skip silently. This prevents duplicate sends caused by upstream retries or at-least-once queue semantics.

Analytics dashboard: aggregate delivery events to compute per-channel delivery rate, per-notification-type open rate, failure breakdown by error code, and latency percentiles from created_at to DELIVERED. Sudden delivery rate drops indicate provider outages or certificate expiry.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design a multi-channel notification dispatcher that supports push, email, SMS, and in-app?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Model the dispatch pipeline as: (1) a `notification_requests` intake API that accepts (user_id, event_type, payload) and writes to an intake queue; (2) a router service that loads user channel preferences and opted-in channels from a preferences store (Redis cache backed by Postgres), then fans out to per-channel queues (push_queue, email_queue, sms_queue, inapp_queue); (3) per-channel workers that call the respective provider (FCM, SES, Twilio, internal WebSocket server). Store per-channel worker output in a `notification_deliveries` table (notification_id, channel, status, provider_message_id, timestamp). Channel preference logic handles fallback: if push is not available (no device token), fall back to email. De-duplicate at intake using an idempotency key so upstream retries don't produce duplicate notifications.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement priority queues for notifications so that critical alerts aren't delayed by bulk sends?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assign each notification a priority (e.g., CRITICAL, HIGH, NORMAL, LOW) at intake based on event_type. Route to separate queues per priority (critical_push_queue, bulk_push_queue) rather than a single queue with priority headers — separate queues allow independent scaling of consumers and prevent head-of-line blocking. Allocate consumer thread pools proportionally: e.g., 8 workers on critical, 2 on bulk. For Kafka-backed systems, use separate topics per priority and assign more partitions + consumer instances to higher-priority topics. Apply rate limiting on the bulk tier (e.g., 1000/s per channel) without throttling critical. In the database, index `notification_requests` on (priority DESC, created_at ASC) for any polling-based workers. Monitor queue depth and consumer lag per priority tier as primary SLIs; alert if critical queue lag exceeds 5 seconds.”
}
},
{
“@type”: “Question”,
“name”: “How do you track delivery status across multiple providers and surface it to callers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each channel worker writes an initial `notification_deliveries` row with status=’sent’ and the provider's message ID immediately after a successful API call. Providers deliver status callbacks (FCM delivery receipts, SES SNS bounce/delivery events, Twilio webhooks) to a callback ingestion endpoint, which updates the delivery row to ‘delivered’ or ‘failed’ with a failure reason code. For providers that don't support callbacks (some SMS carriers), poll the provider status API with exponential backoff up to a maximum staleness window (e.g., 24h), then mark as ‘unknown’. Expose a status API: GET /notifications/{id}/status returns a rollup across all channels. Implement a dead-letter queue for failed deliveries with a retry policy (3 attempts with exponential backoff); after exhausting retries, emit a `notification.failed` event for upstream alerting or fallback channel escalation.”
}
},
{
“@type”: “Question”,
“name”: “How do you prevent notification storms and implement user-level rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Apply two layers of rate limiting. First, per-user frequency capping: use a sliding window counter in Redis (INCR + EXPIRE or a sorted set with timestamps) keyed by (user_id, channel, window). If a user has already received N notifications of a given priority in the window (e.g., 5 push/hour for NORMAL), suppress or defer the new notification and log the suppression. Second, per-channel global throughput limiting: token bucket in Redis (or a rate-limiting sidecar like Envoy) enforces provider SLA limits (e.g., FCM: 600K/min). For bulk campaigns that could produce millions of notifications simultaneously, use a scheduled-dispatch pattern: write all intended notifications to a `scheduled_notifications` table and drain them through a controlled worker at a rate that respects global limits, rather than enqueuing all at once. Alert on suppression rate spikes as a leading indicator of upstream event loop bugs.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top