Notification systems are invisible infrastructure that every major app depends on. A good design handles multiple channels (push, email, SMS), user preferences, rate limits, deduplication, and failure retry — all at millions of notifications per second.
Functional Requirements
- Send notifications via multiple channels: iOS push (APNs), Android push (FCM), email, SMS
- Support notification types: transactional (OTP, order confirmation), marketing (promotions), alerts (security, system)
- User notification preferences: opt-out per channel and per category
- Deduplication: never send the same notification twice
- Delivery tracking: sent, delivered, opened
Non-Functional Requirements
| Metric | Target |
|---|---|
| Throughput | 10 million notifications/day (~115/s average, 10K/s peak marketing blast) |
| Latency (transactional) | < 1s end-to-end |
| Latency (marketing) | Best effort, minutes acceptable |
| Availability | 99.9% |
| At-least-once delivery | Required (with dedup at destination) |
High-Level Architecture
[Notification Service API]
|
[Validator + Enricher] ← fetches user prefs, device tokens
|
[Priority Queue (Kafka)]
/ |
[Push Worker] [Email Worker] [SMS Worker]
| | |
APNs/FCM SendGrid/SES Twilio/SNS
|
[Delivery Tracker] → [Analytics DB]
Notification Service API
POST /v1/notify
{
"recipient_id": "user_123",
"type": "order_shipped",
"priority": "high", // high | normal | low
"channels": ["push", "email"], // or ["all"]
"template_id": "order_shipped_v2",
"data": {
"order_id": "ORD-456",
"tracking_url": "https://..."
},
"dedup_key": "order_shipped:ORD-456" // idempotency key
}
Validator and Enricher
Before queuing, the enricher:
- Looks up the user record — check opt-out preferences per channel and category
- Fetches device tokens (for push) from the device registry
- Checks dedup store (Redis
SET NXwith TTL ondedup_key) — skip if already sent - Renders the template using the provided data
- Assigns priority and routes to the appropriate Kafka partition
def enrich_and_queue(notification):
user = user_service.get(notification.recipient_id)
# Check opt-out
if not user.preferences.allows(notification.type, notification.channel):
return {"status": "suppressed", "reason": "user_opt_out"}
# Deduplication
dedup_key = f"notif:dedup:{notification.dedup_key}"
if not redis.set(dedup_key, "1", nx=True, ex=86400):
return {"status": "suppressed", "reason": "duplicate"}
# Fetch tokens
tokens = device_registry.get_tokens(user.id, platform=notification.platform)
# Route to Kafka by priority
topic = f"notifications.{notification.priority}"
kafka.produce(topic, notification.to_dict())
return {"status": "queued"}
Priority Queues
Use separate Kafka topics per priority tier:
notifications.high— OTPs, security alerts. Dedicated consumers, low lag SLO (<1s)notifications.normal— order updates, friend requestsnotifications.low— marketing blasts, newsletters. Can lag significantly
Marketing campaigns produce to notifications.low in large batches. The low-priority consumers are rate-limited to avoid overwhelming email/SMS providers.
Channel Workers
Push Worker (APNs / FCM)
def send_push(notification):
for token in notification.tokens:
try:
if notification.platform == "ios":
apns_client.send(token, notification.payload)
else:
fcm_client.send(token, notification.payload)
except InvalidTokenError:
# Token expired — remove from device registry
device_registry.delete(token)
except RateLimitError:
# Back off and retry
requeue_with_delay(notification, delay=30)
Email Worker
Use SendGrid or AWS SES. Batch up to 1000 recipients per API call for marketing. Handle bounces and unsubscribes via webhooks — update the user opt-out table immediately to comply with CAN-SPAM and GDPR.
SMS Worker
Twilio or AWS SNS. SMS is expensive (~$0.01/message) — gate aggressively on user opt-in. Use E.164 phone number format. Handle delivery receipts asynchronously via webhooks.
Retry and Dead Letter Queue
Retry policy (exponential backoff):
Attempt 1: immediate
Attempt 2: 30s
Attempt 3: 5 min
Attempt 4: 30 min
Attempt 5: DLQ → alert on-call
DLQ processing:
- Manual inspection dashboard
- Automated replay after provider outage clears
- Metrics: DLQ depth, DLQ growth rate
Deduplication
Three layers prevent duplicate sends:
- Enricher:
SET NXondedup_keybefore queuing (24h window) - Worker: idempotent delivery using provider-side dedup keys (APNs collapse-id, FCM message-id)
- Delivery tracker: record sent events; skip if already recorded for this dedup_key
Delivery Tracking
CREATE TABLE notification_events (
id BIGINT PRIMARY KEY,
dedup_key VARCHAR(255),
channel ENUM("push","email","sms"),
status ENUM("queued","sent","delivered","opened","failed"),
provider_id VARCHAR(255), -- APNs/FCM/SendGrid message ID
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Provider webhooks update status asynchronously
Rate Limiting Per User
Users should not receive spam even from legitimate traffic:
- Max 3 push notifications per hour per user
- Max 1 marketing email per day per user
- Max 5 SMS per day per user
Implement with Redis sliding window counters: INCR notif:{user_id}:{channel}:{hour} with TTL = 1 hour. Reject if over limit.
Scaling to Billions
- Kafka partitioned by user_id shard — keeps ordering per user
- Push workers scale horizontally; APNs/FCM allow thousands of concurrent connections
- Email: SendGrid handles burst; use dedicated IPs for high-volume senders to protect reputation
- User preferences cache in Redis — preferences change rarely
- Marketing blasts: fan-out from batch job → Kafka at controlled rate, not all at once
Interview Discussion Points
- How do you handle APNs token invalidation? Listen for
BadDeviceTokenresponses and delete from registry - How do you prevent spam during a marketing blast to 50M users? Rate limit Kafka produce rate; stagger by timezone
- How do you guarantee OTP delivery? Retry aggressively across SMS providers (primary + fallback), escalate to voice call if SMS fails
- What if the enricher is down? Accept the request, queue a raw event, and enrich lazily in the worker
Frequently Asked Questions
How do you handle notification deduplication at scale?
Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., "order_shipped:ORD-456") with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.
How do you prioritize notifications so OTPs are never delayed by marketing blasts?
Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.
How do you handle APNs/FCM token invalidation in a notification system?
When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you handle notification deduplication at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Three layers: (1) At intake, use Redis SET NX on a dedup_key (e.g., “order_shipped:ORD-456″) with a 24-hour TTL — if the key already exists, suppress the notification before it enters the queue. (2) In the channel worker, use provider-side idempotency keys (APNs collapse-id, FCM message-id) so the provider deduplicates retries. (3) In the delivery tracker, record sent events by dedup_key and skip if already recorded. This three-layer approach handles crashes, retries, and duplicate API calls without relying on any single point.”
}
},
{
“@type”: “Question”,
“name”: “How do you prioritize notifications so OTPs are never delayed by marketing blasts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use separate Kafka topics (or queues) per priority tier: notifications.high for transactional messages (OTPs, security alerts) with dedicated consumers and a strict lag SLO under 1 second; notifications.normal for operational messages; notifications.low for marketing campaigns. High-priority consumers have more instances and are never starved. Marketing blasts produce to the low-priority topic at a rate-limited pace — they never block the high-priority path. Monitor consumer lag per topic and alert on any high-priority lag above threshold.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle APNs/FCM token invalidation in a notification system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When Apple APNs or Google FCM returns a BadDeviceToken or NotRegistered error, the device token is no longer valid (user uninstalled the app or re-registered). Your push worker must catch these specific error codes and immediately delete the invalid token from the device registry. Do NOT retry delivery — the token is permanently invalid. For FCM, also handle registration_id updates where FCM returns a new canonical token to replace the old one. Run a periodic cleanup job to remove tokens that have not received a successful delivery in 90+ days.”
}
}
]
}