Notification System Requirements
A notification system routes messages from producers (application events) to consumers (users) via multiple channels: push notifications (iOS/Android), email, SMS, in-app notifications, and Slack/Teams webhooks. Scale requirements for a large platform (100M users): tens of millions of notifications per hour during peak events (breaking news, sports scores, marketing campaigns), delivery latency < 5 seconds for real-time notifications, high deliverability (98%+ for email, 95%+ for push), and per-user preference management (users opt out of specific notification types).
Notification Channels
- Push notifications (iOS): sent via Apple Push Notification service (APNs). The server authenticates with APNs using a JWT or certificate, sends a JSON payload with device token, and APNs delivers to the device. Device tokens are per-app-install and change when the app is reinstalled. Token refresh must be handled — use the APNs feedback service to remove invalid tokens.
- Push notifications (Android): sent via Firebase Cloud Messaging (FCM). Similar to APNs; registration tokens replace device tokens. FCM also supports topic subscriptions (broadcast to all subscribers of “sports:nba”) and device groups.
- Email: sent via SMTP or transactional email APIs (SendGrid, Mailgun, SES). Deliverability is complex: SPF/DKIM/DMARC records, sender reputation, bounce handling, unsubscribe management (CAN-SPAM, GDPR requirements).
- SMS: sent via Twilio, AWS SNS, or direct carrier connections. Most expensive per-message. Use for critical alerts (security codes, two-factor authentication, payment confirmations) where push/email may not be seen quickly.
Architecture: Fanout Service
# Notification flow:
1. Event producer publishes event to Kafka topic (e.g., "user.liked.your.post")
2. Notification Service consumes event, evaluates:
- Does user X have notifications enabled for this event type?
- Is user X in the rate limit window? (max N notifications/hour)
- Is user X in a quiet hours window? (no notifications 11 PM - 7 AM)
3. If eligible, look up user's device tokens, email, phone
4. Fan out to delivery workers via channel-specific queues:
Kafka topic "notifications.push" → APNs/FCM worker
Kafka topic "notifications.email" → SendGrid worker
Kafka topic "notifications.sms" → Twilio worker
5. Workers deliver and record status in notifications table
# For large-scale fanout (e.g., broadcast to 50M users):
# Do NOT generate 50M individual notification records synchronously
# Instead: create one "broadcast" record with a template + recipient criteria
# Workers query the database: SELECT device_tokens FROM users WHERE ... LIMIT 1000
# Process in batches, parallel workers, APNs/FCM bulk APIs
Delivery Tracking and Receipts
-- Notifications table
CREATE TABLE notifications (
id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
type VARCHAR(50) NOT NULL, -- "like", "comment", "marketing"
channel VARCHAR(20) NOT NULL, -- "push", "email", "sms"
payload JSONB,
status VARCHAR(20) DEFAULT "pending", -- pending/sent/delivered/failed
sent_at TIMESTAMPTZ,
delivered_at TIMESTAMPTZ,
failed_reason TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Index for user notification history:
CREATE INDEX idx_notifications_user ON notifications(user_id, created_at DESC);
-- Delivery receipt handling:
-- APNs provides delivery receipts via HTTP/2 response status
-- FCM provides delivery receipts via canonical_ids and error codes
-- Email: track opens (1px tracking pixel) and clicks (link redirect)
-- SMS: Twilio sends delivery status webhooks (delivered/undelivered/failed)
User Preferences and Rate Limiting
-- User notification preferences
CREATE TABLE notification_preferences (
user_id BIGINT NOT NULL,
notification_type VARCHAR(50) NOT NULL, -- "social", "marketing", "security"
channel VARCHAR(20) NOT NULL, -- "push", "email", "sms"
enabled BOOLEAN DEFAULT TRUE,
quiet_hours_start TIME, -- local time
quiet_hours_end TIME,
timezone VARCHAR(50),
PRIMARY KEY (user_id, notification_type, channel)
);
# Rate limiting per user per channel:
# Security alerts: unlimited (always send)
# Social (likes, comments): max 10/hour, max 50/day
# Marketing: max 1/day, max 5/week
# System: max 3/hour
def should_send(user_id, notification_type, channel):
prefs = db.get_preferences(user_id, notification_type, channel)
if not prefs.enabled:
return False, "user_opt_out"
# Check quiet hours
user_local_time = get_local_time(prefs.timezone)
if is_quiet_hours(user_local_time, prefs.quiet_hours_start, prefs.quiet_hours_end):
return False, "quiet_hours" # or: schedule for after quiet hours
# Check rate limits (Redis counters)
hourly_key = f"notif_rate:{user_id}:{channel}:{current_hour}"
if redis.incr(hourly_key) > prefs.max_per_hour:
redis.expire(hourly_key, 3600)
return False, "rate_limited"
return True, None
Deduplication
Duplicate notifications happen when the producer publishes the same event multiple times (at-least-once Kafka delivery) or when retries send the same notification twice. Deduplication: generate a deterministic idempotency key for each notification (e.g., hash of event_id + user_id + channel + hour). Before sending, check if this key exists in Redis. If it does, skip sending. After sending, set the key in Redis with a TTL of 24 hours.
import hashlib, redis
r = redis.Redis()
def send_notification_deduped(event_id, user_id, channel, payload):
# Deterministic key — same input always produces the same key
dedup_key = hashlib.sha256(
f"{event_id}:{user_id}:{channel}".encode()
).hexdigest()
# SET NX: only set if key doesn't exist (atomic check-and-set)
if not r.set(f"notif_dedup:{dedup_key}", 1, nx=True, ex=86400):
return "duplicate_skipped"
# Send the notification
deliver(user_id, channel, payload)
return "sent"
Handling Push Token Invalidation
Device tokens become invalid when users reinstall the app, clear app data, or disable notifications. Sending to invalid tokens wastes resources and can damage sender reputation with APNs/FCM.
# APNs HTTP/2 response handling:
# 410 Gone: token is permanently invalid — delete from database immediately
# 400 Bad device token: invalid format — remove from database
# 429 Too Many Requests: back off exponentially
# FCM error handling:
# UNREGISTERED: token invalid — delete from database
# INVALID_ARGUMENT: token malformed — delete
# SENDER_ID_MISMATCH: token belongs to different project
def process_apns_response(device_token, status_code, error_code):
if status_code == 410 or error_code in ("BadDeviceToken", "Unregistered"):
db.execute("DELETE FROM device_tokens WHERE token = %s", device_token)
return "token_removed"
if status_code == 429:
time.sleep(exponential_backoff())
return "rate_limited"
if status_code == 200:
return "delivered"
Email Deliverability
- SPF (Sender Policy Framework): DNS TXT record listing IP addresses authorized to send email for your domain. ISPs reject email from unauthorized IPs.
- DKIM (DomainKeys Identified Mail): cryptographic signature in email headers, verified using a public key in DNS. Proves the email was not tampered with in transit.
- DMARC: policy that tells ISPs what to do with emails failing SPF/DKIM (reject, quarantine, or none). Prevents spoofing of your domain in phishing attacks.
- Bounce handling: hard bounces (invalid address — 5xx) must be removed immediately. Soft bounces (mailbox full, server busy — 4xx) can be retried. High bounce rate damages sender reputation.
- Unsubscribe: CAN-SPAM requires a working unsubscribe link in marketing emails. Honor opt-outs within 10 business days (legal requirement). List-Unsubscribe header enables one-click unsubscribe in email clients.
Interview Questions
- Design a notification system for a social media platform with 500M users
- How do you handle sending a marketing blast to 50M users within 1 hour?
- A user receives the same notification 5 times — debug and fix the root cause
- How do you implement notification grouping (bundle 5 “like” notifications into one)?
- Design the preference system so users can control notification frequency without database bottlenecks
Frequently Asked Questions
How do you design a notification fanout system for 50 million users?
Sending a notification to 50 million users (a marketing campaign, breaking news alert) requires a distributed fanout architecture — you cannot generate 50M individual messages synchronously. The pipeline: (1) Create a single notification campaign record in the database: {campaign_id, template, target_criteria, scheduled_at}. (2) A campaign orchestrator queries users matching the target criteria in batches of 10,000-50,000 users using cursor-based pagination (WHERE user_id > last_processed_id LIMIT 50000). For each batch, it publishes one "batch notification" message to a Kafka topic per channel (push, email, SMS). (3) Channel-specific fanout workers consume from Kafka. Each worker takes a batch of user IDs, fetches their device tokens/email addresses from the database (or a dedicated device registry service), and calls the delivery API in bulk: APNs supports sending one notification to up to ~1,000 device tokens per HTTP/2 request; Firebase FCM has a multicast API; SendGrid supports batch email sends. (4) Delivery status is tracked asynchronously via webhooks and stored in a partitioned notifications_status table. (5) Throttling: most platforms limit your send rate. APNs throttles based on app traffic patterns; email providers rate-limit by IP reputation. Use token bucket rate limiting at the worker level to stay within limits. At 50M users with 1,000 tokens per APNs request: 50,000 HTTP requests to APNs. At 1,000 requests/second (well within APNs limits), this takes 50 seconds — acceptable for non-time-sensitive campaigns.
How do you handle notification deduplication and prevent users from receiving duplicate alerts?
Duplicate notifications happen when: Kafka at-least-once delivery re-delivers a message; a worker times out and the job is retried; a race condition causes two workers to process the same event; or a bug in the producer publishes the same event twice. Prevention architecture: (1) Idempotency key generation: create a deterministic key for each notification based on immutable properties: sha256(event_id + user_id + channel + notification_type). This ensures the same event always maps to the same key, regardless of how many times the event is delivered from Kafka. (2) Redis deduplication check: before sending, SET notification_dedup:{key} 1 NX EX 86400 (SET if Not eXists, expire in 24 hours). If the SET returns nil (key already exists), skip sending and acknowledge the Kafka message. The NX flag makes this check atomic — two concurrent workers trying to process the same event will both attempt the SET; only one will succeed, and the other will see the key already exists. (3) Database-level idempotency: for financial notifications where Redis may not be enough, insert into a notifications_sent table with a UNIQUE constraint on (user_id, event_id, channel). A duplicate attempt will fail the UNIQUE constraint rather than send twice. (4) Time-based deduplication window: set the TTL on the deduplication key to match the maximum retry window (24 hours for most systems). Keys expire automatically, avoiding unbounded Redis growth. Monitor duplicate rates in your metrics — a spike indicates a bug in the producer or Kafka consumer.
How do you implement notification preference management without bottlenecking on a shared database?
User notification preferences are read on every notification send (high read volume) and written infrequently (users change preferences rarely). The naive approach — a database query per user per notification — becomes a bottleneck at scale. Scalable architecture: (1) Preference caching: store user preferences in Redis with a 1-hour TTL. On cache miss, read from the database and populate the cache. Cache invalidation: when a user updates preferences, delete their cache key immediately. With 10M daily active users sending 100M notifications/day, a 1-hour cache TTL reduces database reads by ~99% (most users do not change preferences within an hour). (2) Preference service: a dedicated microservice owns preference data. This service maintains its own in-memory cache (LRU cache per instance) and a Redis shared cache. The notification service calls the preference service via gRPC — the preference service batches lookups (fetch preferences for 1,000 users in one call) rather than per-user queries. (3) Preference event log: preferences are stored as an event log (user_id, event_type, preferences, timestamp). The preference service computes current preferences by replaying the event log, cached in memory per user. Event sourcing means updates are always appends — no locking, highly concurrent writes. (4) Default preferences: define organizational defaults for each notification type. Most users never change defaults — only store and look up overrides. This reduces the preference dataset by 90%+ and simplifies the cache design.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design a notification fanout system for 50 million users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Sending a notification to 50 million users (a marketing campaign, breaking news alert) requires a distributed fanout architecture — you cannot generate 50M individual messages synchronously. The pipeline: (1) Create a single notification campaign record in the database: {campaign_id, template, target_criteria, scheduled_at}. (2) A campaign orchestrator queries users matching the target criteria in batches of 10,000-50,000 users using cursor-based pagination (WHERE user_id > last_processed_id LIMIT 50000). For each batch, it publishes one “batch notification” message to a Kafka topic per channel (push, email, SMS). (3) Channel-specific fanout workers consume from Kafka. Each worker takes a batch of user IDs, fetches their device tokens/email addresses from the database (or a dedicated device registry service), and calls the delivery API in bulk: APNs supports sending one notification to up to ~1,000 device tokens per HTTP/2 request; Firebase FCM has a multicast API; SendGrid supports batch email sends. (4) Delivery status is tracked asynchronously via webhooks and stored in a partitioned notifications_status table. (5) Throttling: most platforms limit your send rate. APNs throttles based on app traffic patterns; email providers rate-limit by IP reputation. Use token bucket rate limiting at the worker level to stay within limits. At 50M users with 1,000 tokens per APNs request: 50,000 HTTP requests to APNs. At 1,000 requests/second (well within APNs limits), this takes 50 seconds — acceptable for non-time-sensitive campaigns.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle notification deduplication and prevent users from receiving duplicate alerts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Duplicate notifications happen when: Kafka at-least-once delivery re-delivers a message; a worker times out and the job is retried; a race condition causes two workers to process the same event; or a bug in the producer publishes the same event twice. Prevention architecture: (1) Idempotency key generation: create a deterministic key for each notification based on immutable properties: sha256(event_id + user_id + channel + notification_type). This ensures the same event always maps to the same key, regardless of how many times the event is delivered from Kafka. (2) Redis deduplication check: before sending, SET notification_dedup:{key} 1 NX EX 86400 (SET if Not eXists, expire in 24 hours). If the SET returns nil (key already exists), skip sending and acknowledge the Kafka message. The NX flag makes this check atomic — two concurrent workers trying to process the same event will both attempt the SET; only one will succeed, and the other will see the key already exists. (3) Database-level idempotency: for financial notifications where Redis may not be enough, insert into a notifications_sent table with a UNIQUE constraint on (user_id, event_id, channel). A duplicate attempt will fail the UNIQUE constraint rather than send twice. (4) Time-based deduplication window: set the TTL on the deduplication key to match the maximum retry window (24 hours for most systems). Keys expire automatically, avoiding unbounded Redis growth. Monitor duplicate rates in your metrics — a spike indicates a bug in the producer or Kafka consumer.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement notification preference management without bottlenecking on a shared database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “User notification preferences are read on every notification send (high read volume) and written infrequently (users change preferences rarely). The naive approach — a database query per user per notification — becomes a bottleneck at scale. Scalable architecture: (1) Preference caching: store user preferences in Redis with a 1-hour TTL. On cache miss, read from the database and populate the cache. Cache invalidation: when a user updates preferences, delete their cache key immediately. With 10M daily active users sending 100M notifications/day, a 1-hour cache TTL reduces database reads by ~99% (most users do not change preferences within an hour). (2) Preference service: a dedicated microservice owns preference data. This service maintains its own in-memory cache (LRU cache per instance) and a Redis shared cache. The notification service calls the preference service via gRPC — the preference service batches lookups (fetch preferences for 1,000 users in one call) rather than per-user queries. (3) Preference event log: preferences are stored as an event log (user_id, event_type, preferences, timestamp). The preference service computes current preferences by replaying the event log, cached in memory per user. Event sourcing means updates are always appends — no locking, highly concurrent writes. (4) Default preferences: define organizational defaults for each notification type. Most users never change defaults — only store and look up overrides. This reduces the preference dataset by 90%+ and simplifies the cache design.”
}
}
]
}