System Design Interview: Notification System (Push, Email, SMS)

Notification systems are invisible infrastructure that every major app depends on. A good design handles multiple channels (push, email, SMS), user preferences, rate limits, deduplication, and failure retry — all at millions of notifications per second.

Functional Requirements

  • Send notifications via multiple channels: iOS push (APNs), Android push (FCM), email, SMS
  • Support notification types: transactional (OTP, order confirmation), marketing (promotions), alerts (security, system)
  • User notification preferences: opt-out per channel and per category
  • Deduplication: never send the same notification twice
  • Delivery tracking: sent, delivered, opened

Non-Functional Requirements

Metric Target
Throughput 10 million notifications/day (~115/s average, 10K/s peak marketing blast)
Latency (transactional) < 1s end-to-end
Latency (marketing) Best effort, minutes acceptable
Availability 99.9%
At-least-once delivery Required (with dedup at destination)

High-Level Architecture

  [Notification Service API]
         |
    [Validator + Enricher]  ← fetches user prefs, device tokens
         |
    [Priority Queue (Kafka)]
    /         |          
[Push Worker] [Email Worker] [SMS Worker]
    |              |              |
  APNs/FCM    SendGrid/SES    Twilio/SNS
         |
   [Delivery Tracker] → [Analytics DB]

Notification Service API

POST /v1/notify
{
  "recipient_id": "user_123",
  "type": "order_shipped",
  "priority": "high",           // high | normal | low
  "channels": ["push", "email"], // or ["all"]
  "template_id": "order_shipped_v2",
  "data": {
    "order_id": "ORD-456",
    "tracking_url": "https://..."
  },
  "dedup_key": "order_shipped:ORD-456"  // idempotency key
}

Validator and Enricher

Before queuing, the enricher:

  1. Looks up the user record — check opt-out preferences per channel and category
  2. Fetches device tokens (for push) from the device registry
  3. Checks dedup store (Redis SET NX with TTL on dedup_key) — skip if already sent
  4. Renders the template using the provided data
  5. Assigns priority and routes to the appropriate Kafka partition
def enrich_and_queue(notification):
    user = user_service.get(notification.recipient_id)

    # Check opt-out
    if not user.preferences.allows(notification.type, notification.channel):
        return {"status": "suppressed", "reason": "user_opt_out"}

    # Deduplication
    dedup_key = f"notif:dedup:{notification.dedup_key}"
    if not redis.set(dedup_key, "1", nx=True, ex=86400):
        return {"status": "suppressed", "reason": "duplicate"}

    # Fetch tokens
    tokens = device_registry.get_tokens(user.id, platform=notification.platform)

    # Route to Kafka by priority
    topic = f"notifications.{notification.priority}"
    kafka.produce(topic, notification.to_dict())
    return {"status": "queued"}

Priority Queues

Use separate Kafka topics per priority tier:

  • notifications.high — OTPs, security alerts. Dedicated consumers, low lag SLO (<1s)
  • notifications.normal — order updates, friend requests
  • notifications.low — marketing blasts, newsletters. Can lag significantly

Marketing campaigns produce to notifications.low in large batches. The low-priority consumers are rate-limited to avoid overwhelming email/SMS providers.

Channel Workers

Push Worker (APNs / FCM)

def send_push(notification):
    for token in notification.tokens:
        try:
            if notification.platform == "ios":
                apns_client.send(token, notification.payload)
            else:
                fcm_client.send(token, notification.payload)
        except InvalidTokenError:
            # Token expired — remove from device registry
            device_registry.delete(token)
        except RateLimitError:
            # Back off and retry
            requeue_with_delay(notification, delay=30)

Email Worker

Use SendGrid or AWS SES. Batch up to 1000 recipients per API call for marketing. Handle bounces and unsubscribes via webhooks — update the user opt-out table immediately to comply with CAN-SPAM and GDPR.

SMS Worker

Twilio or AWS SNS. SMS is expensive (~$0.01/message) — gate aggressively on user opt-in. Use E.164 phone number format. Handle delivery receipts asynchronously via webhooks.

Retry and Dead Letter Queue

Retry policy (exponential backoff):
  Attempt 1: immediate
  Attempt 2: 30s
  Attempt 3: 5 min
  Attempt 4: 30 min
  Attempt 5: DLQ → alert on-call

DLQ processing:
  - Manual inspection dashboard
  - Automated replay after provider outage clears
  - Metrics: DLQ depth, DLQ growth rate

Deduplication

Three layers prevent duplicate sends:

  1. Enricher: SET NX on dedup_key before queuing (24h window)
  2. Worker: idempotent delivery using provider-side dedup keys (APNs collapse-id, FCM message-id)
  3. Delivery tracker: record sent events; skip if already recorded for this dedup_key

Delivery Tracking

CREATE TABLE notification_events (
    id          BIGINT PRIMARY KEY,
    dedup_key   VARCHAR(255),
    channel     ENUM("push","email","sms"),
    status      ENUM("queued","sent","delivered","opened","failed"),
    provider_id VARCHAR(255),  -- APNs/FCM/SendGrid message ID
    created_at  TIMESTAMP,
    updated_at  TIMESTAMP
);

-- Provider webhooks update status asynchronously

Rate Limiting Per User

Users should not receive spam even from legitimate traffic:

  • Max 3 push notifications per hour per user
  • Max 1 marketing email per day per user
  • Max 5 SMS per day per user

Implement with Redis sliding window counters: INCR notif:{user_id}:{channel}:{hour} with TTL = 1 hour. Reject if over limit.

Scaling to Billions

  • Kafka partitioned by user_id shard — keeps ordering per user
  • Push workers scale horizontally; APNs/FCM allow thousands of concurrent connections
  • Email: SendGrid handles burst; use dedicated IPs for high-volume senders to protect reputation
  • User preferences cache in Redis — preferences change rarely
  • Marketing blasts: fan-out from batch job → Kafka at controlled rate, not all at once

Interview Discussion Points

  • How do you handle APNs token invalidation? Listen for BadDeviceToken responses and delete from registry
  • How do you prevent spam during a marketing blast to 50M users? Rate limit Kafka produce rate; stagger by timezone
  • How do you guarantee OTP delivery? Retry aggressively across SMS providers (primary + fallback), escalate to voice call if SMS fails
  • What if the enricher is down? Accept the request, queue a raw event, and enrich lazily in the worker

Frequently Asked Questions

How do you handle notification deduplication at scale?

Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., "order_shipped:ORD-456") with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.

How do you prioritize notifications so OTPs are never delayed by marketing blasts?

Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.

How do you handle APNs/FCM token invalidation in a notification system?

When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you handle notification deduplication at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., “order_shipped:ORD-456″) with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.”
}
},
{
“@type”: “Question”,
“name”: “How do you prioritize notifications so OTPs are never delayed by marketing blasts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle APNs/FCM token invalidation in a notification system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.”
}
}
]
}

Companies That Ask This Question

Scroll to Top