A notification system is asked at almost every senior system design interview because it surfaces real architectural challenges: fanout at scale, multi-channel delivery, rate limiting, retry logic, and provider reliability. Facebook, Twitter, and Uber all send billions of notifications per day through systems that look similar at their core.
Step 1: Clarify Requirements
- Channels: Push notifications (iOS APNs, Android FCM), email, SMS — which do we support?
- Scale: Notifications per day? Triggered by user actions, or also scheduled/marketing?
- Latency: Real-time (< 1 second for transaction alerts) vs. near-real-time (< 1 min for marketing) vs. batch (nightly digest)?
- Reliability: At-least-once? What’s the acceptable loss rate?
- Preferences: Users can opt out of channels/types?
- Templates: Parameterized content (“Hi {name}, your order {order_id} shipped”) or raw strings?
Assume: All three channels (push, email, SMS), 10M notifications/day (115/sec average, 1,000/sec peak), real-time transactional + batch marketing, at-least-once delivery, user preference controls, template-based content.
Step 2: High-Level Flow
Trigger source (order service, etc.)
↓
Notification Service API
↓
Kafka "notification_requests" topic
↓
Notification Workers (consumer group)
├─ Fetch user preferences (does user want this notification? on which channels?)
├─ Fetch user device tokens / email / phone from User Service
├─ Render template with personalized content
└─ Route to channel-specific queues:
├─ Kafka "push_notifications" → Push Worker → APNs / FCM
├─ Kafka "email_notifications" → Email Worker → SendGrid / SES
└─ Kafka "sms_notifications" → SMS Worker → Twilio / SNS
The Kafka layer is critical. It decouples triggering services from delivery, absorbs traffic spikes (a flash sale triggering 5M emails simultaneously), and provides natural retry semantics.
Step 3: Data Model
-- Notification events
notifications (
notification_id UUID PRIMARY KEY,
user_id BIGINT,
type VARCHAR(50), -- 'order_shipped', 'friend_request', etc.
template_id VARCHAR(50),
params JSONB, -- {"order_id": "123", "carrier": "UPS"}
priority SMALLINT, -- 1=critical, 2=normal, 3=marketing
created_at TIMESTAMP DEFAULT NOW()
)
-- Delivery tracking (one row per channel attempt)
notification_deliveries (
delivery_id UUID PRIMARY KEY,
notification_id UUID REFERENCES notifications,
channel VARCHAR(10), -- 'push', 'email', 'sms'
status VARCHAR(20), -- 'pending', 'sent', 'delivered', 'failed'
provider VARCHAR(20), -- 'apns', 'fcm', 'sendgrid', 'twilio'
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
failure_reason TEXT,
attempt_count SMALLINT DEFAULT 0
)
-- User notification preferences
user_notification_prefs (
user_id BIGINT,
notification_type VARCHAR(50),
channel VARCHAR(10),
enabled BOOLEAN DEFAULT TRUE,
PRIMARY KEY (user_id, notification_type, channel)
)
-- Device tokens
device_tokens (
user_id BIGINT,
platform VARCHAR(10), -- 'ios', 'android'
token VARCHAR(256),
created_at TIMESTAMP,
PRIMARY KEY (user_id, token)
)
Step 4: Preference Checking and Fanout
Before sending, the Notification Worker must check:
- User opt-out: Has the user disabled this notification type or channel?
- Global suppression: Is the user on a do-not-disturb list (e.g., legal hold)?
- Quiet hours: Is it 2 AM in the user’s timezone? (Defer non-critical notifications.)
- Unsubscribed: Did the user unsubscribe from marketing emails?
def process_notification(notification):
user_prefs = preference_cache.get(notification.user_id) # Redis cache
channels_to_send = []
for channel in ['push', 'email', 'sms']:
if user_prefs.is_enabled(notification.type, channel):
if not is_quiet_hours(notification.user_id, channel):
channels_to_send.append(channel)
for channel in channels_to_send:
channel_queue.publish(channel, {
'notification_id': notification.id,
'user_id': notification.user_id,
'rendered_content': render_template(notification),
'channel': channel,
})
User preferences are cached in Redis — they’re read on every notification but change rarely.
Step 5: Channel Workers and Third-Party Providers
Each channel worker sends to a third-party provider:
Push Notifications
iOS → Apple Push Notification Service (APNs) — HTTP/2 API
Android → Firebase Cloud Messaging (FCM) — HTTP API
def send_push(notification, device_token, platform):
if platform == 'ios':
response = apns.send(device_token, notification.payload)
else:
response = fcm.send(device_token, notification.payload)
if response.status == 'InvalidToken':
# Token expired — remove from DB, don't retry
device_token_db.delete(device_token)
elif response.status == 'success':
mark_delivered(notification.id, 'push')
else:
# Transient failure — requeue with backoff
push_queue.requeue(notification, delay=exponential_backoff(attempt))
Use a transactional email provider (SendGrid, AWS SES, Mailgun). Never run your own SMTP server — deliverability requires years of IP reputation building.
def send_email(notification, user_email):
sendgrid.send(
to=user_email,
subject=render(notification.template.subject, notification.params),
html_body=render(notification.template.html, notification.params),
unsubscribe_link=generate_unsub_link(notification.user_id)
)
SMS
Use Twilio or AWS SNS. SMS has per-message cost (~$0.0075/message) — only send for high-priority notifications. Always include opt-out instructions (“Reply STOP to unsubscribe”) — required by law in most jurisdictions.
Step 6: Retry Logic and Dead Letter Queue
Transient failures (provider outage, rate limiting, network blip) must be retried. Permanent failures (invalid token, unsubscribed email) must not be retried.
Retry strategy:
Attempt 1: immediate
Attempt 2: 30 seconds
Attempt 3: 5 minutes
Attempt 4: 30 minutes
Attempt 5: 2 hours
After 5 attempts: → Dead Letter Queue (DLQ)
DLQ:
- Alerts on-call engineer
- Stores failed notifications for manual inspection
- Allows bulk replay after provider outage recovers
The DLQ (see Message Queues) is essential for any notification system that cannot afford to silently drop messages.
Step 7: Rate Limiting
Two dimensions of rate limiting:
Per-user rate limiting: Prevent spamming a single user. No more than 10 push notifications per user per hour for marketing; critical notifications bypass the limit.
# Redis token bucket per user per channel
def can_notify(user_id, channel, priority):
if priority == CRITICAL:
return True # always allow critical
key = f"notif_rate:{user_id}:{channel}"
count = redis.incr(key)
if count == 1:
redis.expire(key, 3600) # 1-hour window
return count <= RATE_LIMITS[channel] # e.g., 10/hr for push
Provider rate limiting: APNs, FCM, and Twilio all have rate limits. The channel workers respect these by using leaky bucket rate limiting against the provider API.
Step 8: Batch / Scheduled Notifications
Marketing notifications (weekly digest, promotional email to 50M users) cannot be sent all at once — the Kafka consumer would need to process 50M messages in minutes.
Solution: Segment users into batches of ~10K. Schedule each batch to a time window staggered across hours. Store scheduled notifications in a time-indexed table:
scheduled_notifications (
schedule_id UUID,
campaign_id UUID,
user_segment VARCHAR(50),
send_after TIMESTAMP, -- staggered across 6 hours
status VARCHAR(20)
)
A scheduler service polls this table and enqueues batches into Kafka as their send_after time arrives.
High-Level Architecture
Trigger Services (Order, Auth, Social)
↓ REST/gRPC
Notification API
↓
Kafka "notification_requests"
↓
Notification Workers
├─ Preference check (Redis cache)
├─ Template rendering
└─ Fan out to channel queues
├─ Kafka "push" → Push Worker → APNs / FCM
├─ Kafka "email" → Email Worker → SendGrid / SES
└─ Kafka "sms" → SMS Worker → Twilio
↓ (all channels)
Delivery DB (Cassandra/Postgres)
↓ failures
Dead Letter Queue → Alerts + Replay
Follow-up Questions
Q: How do you handle provider outages (e.g., APNs goes down for 2 hours)?
Messages queue up in Kafka (durable, retained). When APNs recovers, workers drain the backlog. For long outages, fall back to alternative channels — if push fails, send email instead for critical notifications.
Q: How do you track whether a user actually read the notification?
For push: the app reports an “opened” event back to the analytics service when the user taps. For email: embed a 1×1 tracking pixel — when the email client loads the image, it hits your server, recording an open. For SMS: no open tracking (delivery receipt only).
Q: How do you handle multi-device users?
The device_tokens table has one row per device, multiple devices per user. The Push Worker sends to all active devices for the user. If one device’s token is invalid (old phone), remove it. Deduplicate on the client to avoid showing the same notification twice.
Summary
A notification system is an async fanout pipeline. Kafka decouples ingestion from delivery and handles spikes. Channel workers send to third-party providers (APNs, FCM, SendGrid, Twilio) — never build your own. Rate limiting prevents spamming users; user preference checks prevent sending unwanted notifications. Retry with exponential backoff for transient failures; route permanent failures to a DLQ. Batch marketing campaigns are staggered to avoid thundering herd against providers.
Related System Design Topics
- Message Queues — Kafka fan-out is the backbone of notification delivery
- Caching Strategies — rate-limit state and user preference lookups
- API Design — REST endpoints for subscription management
- Load Balancing — distribute worker pods across channels (email/push/SMS)
- Database Sharding — sharding the notification log by user_id
Companies That Ask This System Design Question
This problem type commonly appears in interviews at:
- Uber Interview Guide
- Airbnb Interview Guide
- Meta Interview Guide
- Twitch Interview Guide
- LinkedIn Interview Guide
See our company interview guides for full interview process, compensation, and preparation tips.