Notification Delivery Service Low-Level Design: Multi-Channel Dispatch, Priority Queues, and Delivery Tracking

Notification Schema

Every notification is a structured record created before any delivery is attempted:

notification_id: UUID or Snowflake — unique identifier, used for deduplication
user_id: target recipient
type: marketing | transactional | security — determines default channel policy and override permissions
title, body: rendered content for display channels
channels_requested[]: which channels the sender wants to use (push, email, SMS, in-app)
data{}: arbitrary key-value payload for deep-link routing in the receiving app
priority: low | normal | high | critical — controls queue routing and retry urgency
idempotency_key: caller-supplied key for deduplication — prevents duplicate sends on retry
created_at: used for TTL enforcement and analytics

Channel Dispatch Flow

When a notification is created, the dispatch pipeline executes before any message leaves the system:

Preference check: load the user's per-channel opt-in/opt-out settings from the preferences store
Channel filtering: intersect channels_requested with user's opted-in channels; security notifications bypass marketing opt-outs
Priority resolution: determine the effective priority; security notifications always escalate to high/critical
Queue routing: publish one task per approved channel to the appropriate priority queue

User preferences are cached in Redis (TTL 5 minutes) to avoid a database read on every notification.

Priority Queues

Separate queues per priority level ensure critical notifications are never starved by marketing volume:

Critical queue: polled continuously; workers assigned exclusively — used for security alerts, 2FA codes, payment failures
High queue: polled every few seconds; shared workers with critical fallback
Normal queue: polled every 30 seconds; standard transactional notifications
Low queue: polled infrequently (minutes); marketing, digest emails, weekly reports

During traffic spikes, low-priority queues grow while critical queues drain immediately. This is intentional — a promotional email arriving 10 minutes late is acceptable; a password reset code delayed by 10 minutes is not.

Push Notification Worker

Push workers interface with mobile platform providers:

APNs (Apple Push Notification service): HTTP/2 API with JWT token authentication (rotate every 60 minutes); connection pooling critical — APNs limits connections per app; send up to 1000 notifications per second per connection
FCM (Firebase Cloud Messaging): HTTP v1 API with OAuth 2.0 service account; supports topic messaging for broadcast use cases
Token lifecycle: APNs returns status 410 (Gone) for unregistered tokens — remove from DB immediately to avoid repeated failed sends; FCM provides a new canonical registration token in the response when a token is refreshed

Email and SMS Workers

Email worker: sends via SES or SendGrid transactional API. Rate limits are per sending domain (e.g., 14 sends/second on SES default). Handle bounce callbacks via SNS webhook — hard bounces must be removed from the active address list immediately to protect domain reputation.

SMS worker: sends via Twilio or Amazon SNS. Phone numbers must be in E.164 format (+14155551234). Delivery receipts arrive asynchronously via webhook — update notification status on receipt. Maintain an opt-out registry: numbers that replied STOP must never be messaged again (regulatory requirement in most jurisdictions).

In-App Notification Worker

In-app notifications are stored in a notifications table and delivered over WebSocket if the user is connected:

Write the notification record with status PENDING
Check the presence service — if user has an active WebSocket connection, push immediately and mark DELIVERED
If offline, leave as PENDING; client fetches unread notifications on next app open
In-app notifications do not require provider integration and have effectively zero delivery cost

Delivery Status Tracking

Each notification per channel follows a state machine:

PENDING → DISPATCHED → DELIVERED
                    ↘ FAILED

PENDING: queued, not yet sent to provider
DISPATCHED: submitted to provider API successfully; awaiting delivery confirmation
DELIVERED: provider confirmed delivery (push/SMS delivery receipt, email open event, in-app ACK)
FAILED: non-retryable error or max retries exhausted

Status transitions are written to a notification_events log table for auditability and analytics.

Retry Policy with Exponential Backoff

Transient failures (provider rate limit, network timeout) are retried with exponential backoff:

Retry delays: 1s, 2s, 4s, 8s, 16s — max 5 attempts before marking FAILED
Add jitter (±20%) to prevent thundering herd when many notifications retry simultaneously after a provider outage

Non-retryable errors terminate immediately without retry:

Invalid device token (push) — token is stale, remove from DB
User opted out (email/SMS) — add to suppression list
Invalid phone number format

Deduplication and Analytics

Deduplication: before dispatching any channel task, check Redis with SET notification:{idempotency_key} 1 NX EX 86400. If the key already exists, the notification was already sent — skip silently. This prevents duplicate sends caused by upstream retries or at-least-once queue semantics.

Analytics dashboard: aggregate delivery events to compute per-channel delivery rate, per-notification-type open rate, failure breakdown by error code, and latency percentiles from created_at to DELIVERED. Sudden delivery rate drops indicate provider outages or certificate expiry.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design a multi-channel notification dispatcher that supports push, email, SMS, and in-app?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Model the dispatch pipeline as: (1) a `notification_requests` intake API that accepts (user_id, event_type, payload) and writes to an intake queue; (2) a router service that loads user channel preferences and opted-in channels from a preferences store (Redis cache backed by Postgres), then fans out to per-channel queues (push_queue, email_queue, sms_queue, inapp_queue); (3) per-channel workers that call the respective provider (FCM, SES, Twilio, internal WebSocket server). Store per-channel worker output in a `notification_deliveries` table (notification_id, channel, status, provider_message_id, timestamp). Channel preference logic handles fallback: if push is not available (no device token), fall back to email. De-duplicate at intake using an idempotency key so upstream retries don't produce duplicate notifications.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement priority queues for notifications so that critical alerts aren't delayed by bulk sends?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assign each notification a priority (e.g., CRITICAL, HIGH, NORMAL, LOW) at intake based on event_type. Route to separate queues per priority (critical_push_queue, bulk_push_queue) rather than a single queue with priority headers — separate queues allow independent scaling of consumers and prevent head-of-line blocking. Allocate consumer thread pools proportionally: e.g., 8 workers on critical, 2 on bulk. For Kafka-backed systems, use separate topics per priority and assign more partitions + consumer instances to higher-priority topics. Apply rate limiting on the bulk tier (e.g., 1000/s per channel) without throttling critical. In the database, index `notification_requests` on (priority DESC, created_at ASC) for any polling-based workers. Monitor queue depth and consumer lag per priority tier as primary SLIs; alert if critical queue lag exceeds 5 seconds.”
}
},
{
“@type”: “Question”,
“name”: “How do you track delivery status across multiple providers and surface it to callers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each channel worker writes an initial `notification_deliveries` row with status=’sent’ and the provider's message ID immediately after a successful API call. Providers deliver status callbacks (FCM delivery receipts, SES SNS bounce/delivery events, Twilio webhooks) to a callback ingestion endpoint, which updates the delivery row to ‘delivered’ or ‘failed’ with a failure reason code. For providers that don't support callbacks (some SMS carriers), poll the provider status API with exponential backoff up to a maximum staleness window (e.g., 24h), then mark as ‘unknown’. Expose a status API: GET /notifications/{id}/status returns a rollup across all channels. Implement a dead-letter queue for failed deliveries with a retry policy (3 attempts with exponential backoff); after exhausting retries, emit a `notification.failed` event for upstream alerting or fallback channel escalation.”
}
},
{
“@type”: “Question”,
“name”: “How do you prevent notification storms and implement user-level rate limiting?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Apply two layers of rate limiting. First, per-user frequency capping: use a sliding window counter in Redis (INCR + EXPIRE or a sorted set with timestamps) keyed by (user_id, channel, window). If a user has already received N notifications of a given priority in the window (e.g., 5 push/hour for NORMAL), suppress or defer the new notification and log the suppression. Second, per-channel global throughput limiting: token bucket in Redis (or a rate-limiting sidecar like Envoy) enforces provider SLA limits (e.g., FCM: 600K/min). For bulk campaigns that could produce millions of notifications simultaneously, use a scheduled-dispatch pattern: write all intended notifications to a `scheduled_notifications` table and drain them through a controlled worker at a rate that respects global limits, rather than enqueuing all at once. Alert on suppression rate spikes as a leading indicator of upstream event loop bugs.”
}
}
]
}