System Design: Notification System (Push, Email, SMS)

A notification system is asked at almost every senior system design interview because it surfaces real architectural challenges: fanout at scale, multi-channel delivery, rate limiting, retry logic, and provider reliability. Facebook, Twitter, and Uber all send billions of notifications per day through systems that look similar at their core.

Step 1: Clarify Requirements

  • Channels: Push notifications (iOS APNs, Android FCM), email, SMS — which do we support?
  • Scale: Notifications per day? Triggered by user actions, or also scheduled/marketing?
  • Latency: Real-time (< 1 second for transaction alerts) vs. near-real-time (< 1 min for marketing) vs. batch (nightly digest)?
  • Reliability: At-least-once? What’s the acceptable loss rate?
  • Preferences: Users can opt out of channels/types?
  • Templates: Parameterized content (“Hi {name}, your order {order_id} shipped”) or raw strings?

Assume: All three channels (push, email, SMS), 10M notifications/day (115/sec average, 1,000/sec peak), real-time transactional + batch marketing, at-least-once delivery, user preference controls, template-based content.

Step 2: High-Level Flow

Trigger source (order service, etc.)
    ↓
Notification Service API
    ↓
Kafka "notification_requests" topic
    ↓
Notification Workers (consumer group)
    ├─ Fetch user preferences (does user want this notification? on which channels?)
    ├─ Fetch user device tokens / email / phone from User Service
    ├─ Render template with personalized content
    └─ Route to channel-specific queues:
         ├─ Kafka "push_notifications"  → Push Worker  → APNs / FCM
         ├─ Kafka "email_notifications" → Email Worker → SendGrid / SES
         └─ Kafka "sms_notifications"   → SMS Worker   → Twilio / SNS

The Kafka layer is critical. It decouples triggering services from delivery, absorbs traffic spikes (a flash sale triggering 5M emails simultaneously), and provides natural retry semantics.

Step 3: Data Model

-- Notification events
notifications (
    notification_id UUID PRIMARY KEY,
    user_id         BIGINT,
    type            VARCHAR(50),          -- 'order_shipped', 'friend_request', etc.
    template_id     VARCHAR(50),
    params          JSONB,                -- {"order_id": "123", "carrier": "UPS"}
    priority        SMALLINT,             -- 1=critical, 2=normal, 3=marketing
    created_at      TIMESTAMP DEFAULT NOW()
)

-- Delivery tracking (one row per channel attempt)
notification_deliveries (
    delivery_id      UUID PRIMARY KEY,
    notification_id  UUID REFERENCES notifications,
    channel          VARCHAR(10),         -- 'push', 'email', 'sms'
    status           VARCHAR(20),         -- 'pending', 'sent', 'delivered', 'failed'
    provider         VARCHAR(20),         -- 'apns', 'fcm', 'sendgrid', 'twilio'
    sent_at          TIMESTAMP,
    delivered_at     TIMESTAMP,
    failure_reason   TEXT,
    attempt_count    SMALLINT DEFAULT 0
)

-- User notification preferences
user_notification_prefs (
    user_id          BIGINT,
    notification_type VARCHAR(50),
    channel          VARCHAR(10),
    enabled          BOOLEAN DEFAULT TRUE,
    PRIMARY KEY (user_id, notification_type, channel)
)

-- Device tokens
device_tokens (
    user_id    BIGINT,
    platform   VARCHAR(10),              -- 'ios', 'android'
    token      VARCHAR(256),
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, token)
)

Step 4: Preference Checking and Fanout

Before sending, the Notification Worker must check:

  1. User opt-out: Has the user disabled this notification type or channel?
  2. Global suppression: Is the user on a do-not-disturb list (e.g., legal hold)?
  3. Quiet hours: Is it 2 AM in the user’s timezone? (Defer non-critical notifications.)
  4. Unsubscribed: Did the user unsubscribe from marketing emails?
def process_notification(notification):
    user_prefs = preference_cache.get(notification.user_id)  # Redis cache

    channels_to_send = []
    for channel in ['push', 'email', 'sms']:
        if user_prefs.is_enabled(notification.type, channel):
            if not is_quiet_hours(notification.user_id, channel):
                channels_to_send.append(channel)

    for channel in channels_to_send:
        channel_queue.publish(channel, {
            'notification_id': notification.id,
            'user_id': notification.user_id,
            'rendered_content': render_template(notification),
            'channel': channel,
        })

User preferences are cached in Redis — they’re read on every notification but change rarely.

Step 5: Channel Workers and Third-Party Providers

Each channel worker sends to a third-party provider:

Push Notifications

iOS  → Apple Push Notification Service (APNs) — HTTP/2 API
Android → Firebase Cloud Messaging (FCM) — HTTP API

def send_push(notification, device_token, platform):
    if platform == 'ios':
        response = apns.send(device_token, notification.payload)
    else:
        response = fcm.send(device_token, notification.payload)

    if response.status == 'InvalidToken':
        # Token expired — remove from DB, don't retry
        device_token_db.delete(device_token)
    elif response.status == 'success':
        mark_delivered(notification.id, 'push')
    else:
        # Transient failure — requeue with backoff
        push_queue.requeue(notification, delay=exponential_backoff(attempt))

Email

Use a transactional email provider (SendGrid, AWS SES, Mailgun). Never run your own SMTP server — deliverability requires years of IP reputation building.

def send_email(notification, user_email):
    sendgrid.send(
        to=user_email,
        subject=render(notification.template.subject, notification.params),
        html_body=render(notification.template.html, notification.params),
        unsubscribe_link=generate_unsub_link(notification.user_id)
    )

SMS

Use Twilio or AWS SNS. SMS has per-message cost (~$0.0075/message) — only send for high-priority notifications. Always include opt-out instructions (“Reply STOP to unsubscribe”) — required by law in most jurisdictions.

Step 6: Retry Logic and Dead Letter Queue

Transient failures (provider outage, rate limiting, network blip) must be retried. Permanent failures (invalid token, unsubscribed email) must not be retried.

Retry strategy:
  Attempt 1: immediate
  Attempt 2: 30 seconds
  Attempt 3: 5 minutes
  Attempt 4: 30 minutes
  Attempt 5: 2 hours
  After 5 attempts: → Dead Letter Queue (DLQ)

DLQ:
  - Alerts on-call engineer
  - Stores failed notifications for manual inspection
  - Allows bulk replay after provider outage recovers

The DLQ (see Message Queues) is essential for any notification system that cannot afford to silently drop messages.

Step 7: Rate Limiting

Two dimensions of rate limiting:

Per-user rate limiting: Prevent spamming a single user. No more than 10 push notifications per user per hour for marketing; critical notifications bypass the limit.

# Redis token bucket per user per channel
def can_notify(user_id, channel, priority):
    if priority == CRITICAL:
        return True          # always allow critical

    key = f"notif_rate:{user_id}:{channel}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 3600)   # 1-hour window
    return count <= RATE_LIMITS[channel]  # e.g., 10/hr for push

Provider rate limiting: APNs, FCM, and Twilio all have rate limits. The channel workers respect these by using leaky bucket rate limiting against the provider API.

Step 8: Batch / Scheduled Notifications

Marketing notifications (weekly digest, promotional email to 50M users) cannot be sent all at once — the Kafka consumer would need to process 50M messages in minutes.

Solution: Segment users into batches of ~10K. Schedule each batch to a time window staggered across hours. Store scheduled notifications in a time-indexed table:

scheduled_notifications (
    schedule_id  UUID,
    campaign_id  UUID,
    user_segment VARCHAR(50),
    send_after   TIMESTAMP,      -- staggered across 6 hours
    status       VARCHAR(20)
)

A scheduler service polls this table and enqueues batches into Kafka as their send_after time arrives.

High-Level Architecture

Trigger Services (Order, Auth, Social)
    ↓ REST/gRPC
Notification API
    ↓
Kafka "notification_requests"
    ↓
Notification Workers
    ├─ Preference check (Redis cache)
    ├─ Template rendering
    └─ Fan out to channel queues
         ├─ Kafka "push"  → Push Worker  → APNs / FCM
         ├─ Kafka "email" → Email Worker → SendGrid / SES
         └─ Kafka "sms"   → SMS Worker   → Twilio
              ↓ (all channels)
         Delivery DB (Cassandra/Postgres)
              ↓ failures
         Dead Letter Queue → Alerts + Replay

Follow-up Questions

Q: How do you handle provider outages (e.g., APNs goes down for 2 hours)?
Messages queue up in Kafka (durable, retained). When APNs recovers, workers drain the backlog. For long outages, fall back to alternative channels — if push fails, send email instead for critical notifications.

Q: How do you track whether a user actually read the notification?
For push: the app reports an “opened” event back to the analytics service when the user taps. For email: embed a 1×1 tracking pixel — when the email client loads the image, it hits your server, recording an open. For SMS: no open tracking (delivery receipt only).

Q: How do you handle multi-device users?
The device_tokens table has one row per device, multiple devices per user. The Push Worker sends to all active devices for the user. If one device’s token is invalid (old phone), remove it. Deduplicate on the client to avoid showing the same notification twice.

Summary

A notification system is an async fanout pipeline. Kafka decouples ingestion from delivery and handles spikes. Channel workers send to third-party providers (APNs, FCM, SendGrid, Twilio) — never build your own. Rate limiting prevents spamming users; user preference checks prevent sending unwanted notifications. Retry with exponential backoff for transient failures; route permanent failures to a DLQ. Batch marketing campaigns are staggered to avoid thundering herd against providers.

Related System Design Topics

Companies That Ask This System Design Question

This problem type commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.

Scroll to Top