System Design Interview: Notification System (Push, Email, SMS)

Q: How do you handle notification deduplication at scale?

Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., "order_shipped:ORD-456") with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.

Q: How do you prioritize notifications so OTPs are never delayed by marketing blasts?

Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.

Q: How do you handle APNs/FCM token invalidation in a notification system?

When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.

⏱ 7 min read

Notification systems are invisible infrastructure that every major app depends on. A good design handles multiple channels (push, email, SMS), user preferences, rate limits, deduplication, and failure retry — all at millions of notifications per second.

Functional Requirements

Send notifications via multiple channels: iOS push (APNs), Android push (FCM), email, SMS
Support notification types: transactional (OTP, order confirmation), marketing (promotions), alerts (security, system)
User notification preferences: opt-out per channel and per category
Deduplication: never send the same notification twice
Delivery tracking: sent, delivered, opened

Non-Functional Requirements

Metric	Target
Throughput	10 million notifications/day (~115/s average, 10K/s peak marketing blast)
Latency (transactional)	< 1s end-to-end
Latency (marketing)	Best effort, minutes acceptable
Availability	99.9%
At-least-once delivery	Required (with dedup at destination)

High-Level Architecture

  [Notification Service API]
         |
    [Validator + Enricher]  ← fetches user prefs, device tokens
         |
    [Priority Queue (Kafka)]
    /         |          
[Push Worker] [Email Worker] [SMS Worker]
    |              |              |
  APNs/FCM    SendGrid/SES    Twilio/SNS
         |
   [Delivery Tracker] → [Analytics DB]

Notification Service API

POST /v1/notify
{
  "recipient_id": "user_123",
  "type": "order_shipped",
  "priority": "high",           // high | normal | low
  "channels": ["push", "email"], // or ["all"]
  "template_id": "order_shipped_v2",
  "data": {
    "order_id": "ORD-456",
    "tracking_url": "https://..."
  },
  "dedup_key": "order_shipped:ORD-456"  // idempotency key
}

Validator and Enricher

Before queuing, the enricher:

Looks up the user record — check opt-out preferences per channel and category
Fetches device tokens (for push) from the device registry
Checks dedup store (Redis SET NX with TTL on dedup_key) — skip if already sent
Renders the template using the provided data
Assigns priority and routes to the appropriate Kafka partition

def enrich_and_queue(notification):
    user = user_service.get(notification.recipient_id)

    # Check opt-out
    if not user.preferences.allows(notification.type, notification.channel):
        return {"status": "suppressed", "reason": "user_opt_out"}

    # Deduplication
    dedup_key = f"notif:dedup:{notification.dedup_key}"
    if not redis.set(dedup_key, "1", nx=True, ex=86400):
        return {"status": "suppressed", "reason": "duplicate"}

    # Fetch tokens
    tokens = device_registry.get_tokens(user.id, platform=notification.platform)

    # Route to Kafka by priority
    topic = f"notifications.{notification.priority}"
    kafka.produce(topic, notification.to_dict())
    return {"status": "queued"}

Priority Queues

Use separate Kafka topics per priority tier:

notifications.high — OTPs, security alerts. Dedicated consumers, low lag SLO (<1s)
notifications.normal — order updates, friend requests
notifications.low — marketing blasts, newsletters. Can lag significantly

Marketing campaigns produce to notifications.low in large batches. The low-priority consumers are rate-limited to avoid overwhelming email/SMS providers.

Channel Workers

Push Worker (APNs / FCM)

def send_push(notification):
    for token in notification.tokens:
        try:
            if notification.platform == "ios":
                apns_client.send(token, notification.payload)
            else:
                fcm_client.send(token, notification.payload)
        except InvalidTokenError:
            # Token expired — remove from device registry
            device_registry.delete(token)
        except RateLimitError:
            # Back off and retry
            requeue_with_delay(notification, delay=30)

Email Worker

Use SendGrid or AWS SES. Batch up to 1000 recipients per API call for marketing. Handle bounces and unsubscribes via webhooks — update the user opt-out table immediately to comply with CAN-SPAM and GDPR.

SMS Worker

Twilio or AWS SNS. SMS is expensive (~$0.01/message) — gate aggressively on user opt-in. Use E.164 phone number format. Handle delivery receipts asynchronously via webhooks.

Retry and Dead Letter Queue

Retry policy (exponential backoff):
  Attempt 1: immediate
  Attempt 2: 30s
  Attempt 3: 5 min
  Attempt 4: 30 min
  Attempt 5: DLQ → alert on-call

DLQ processing:
  - Manual inspection dashboard
  - Automated replay after provider outage clears
  - Metrics: DLQ depth, DLQ growth rate

Deduplication

Three layers prevent duplicate sends:

Enricher: SET NX on dedup_key before queuing (24h window)
Worker: idempotent delivery using provider-side dedup keys (APNs collapse-id, FCM message-id)
Delivery tracker: record sent events; skip if already recorded for this dedup_key

Delivery Tracking

CREATE TABLE notification_events (
    id          BIGINT PRIMARY KEY,
    dedup_key   VARCHAR(255),
    channel     ENUM("push","email","sms"),
    status      ENUM("queued","sent","delivered","opened","failed"),
    provider_id VARCHAR(255),  -- APNs/FCM/SendGrid message ID
    created_at  TIMESTAMP,
    updated_at  TIMESTAMP
);

-- Provider webhooks update status asynchronously

Rate Limiting Per User

Users should not receive spam even from legitimate traffic:

Max 3 push notifications per hour per user
Max 1 marketing email per day per user
Max 5 SMS per day per user

Implement with Redis sliding window counters: INCR notif:{user_id}:{channel}:{hour} with TTL = 1 hour. Reject if over limit.

Scaling to Billions

Kafka partitioned by user_id shard — keeps ordering per user
Push workers scale horizontally; APNs/FCM allow thousands of concurrent connections
Email: SendGrid handles burst; use dedicated IPs for high-volume senders to protect reputation
User preferences cache in Redis — preferences change rarely
Marketing blasts: fan-out from batch job → Kafka at controlled rate, not all at once

Interview Discussion Points

How do you handle APNs token invalidation? Listen for BadDeviceToken responses and delete from registry
How do you prevent spam during a marketing blast to 50M users? Rate limit Kafka produce rate; stagger by timezone
How do you guarantee OTP delivery? Retry aggressively across SMS providers (primary + fallback), escalate to voice call if SMS fails
What if the enricher is down? Accept the request, queue a raw event, and enrich lazily in the worker

Frequently Asked Questions

How do you handle notification deduplication at scale?

Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., "order_shipped:ORD-456") with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.

How do you prioritize notifications so OTPs are never delayed by marketing blasts?

Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.

How do you handle APNs/FCM token invalidation in a notification system?

When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.

LinkedIn Interview Guide

Shopify Interview Guide

Snap Interview Guide

Twitter Interview Guide

Companies That Ask This Question

Meta Engineering Interview Guide

Uber Engineering Interview Guide

Airbnb Engineering Interview Guide

DoorDash Engineering Interview Guide

Twitch Engineering Interview Guide

🏢 Asked at: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence