System Design: Notification System — Push, Email, SMS, In-App, Fanout, Delivery Tracking, User Preferences

A notification system delivers timely information to users through multiple channels — push notifications, email, SMS, and in-app messages. Companies like Facebook, Uber, and Amazon send billions of notifications daily. Designing a notification system that is reliable, scalable, and respectful of user preferences is a classic system design interview question. This guide covers the end-to-end architecture from event generation to delivery tracking.

High-Level Architecture

Components: (1) Event producers — services that generate notification triggers: order service emits OrderShipped, payment service emits PaymentFailed, social service emits NewFollower. Events are published to Kafka topics. (2) Notification service — consumes events, determines which users to notify, checks user preferences, renders the notification content, and routes to the appropriate delivery channel. (3) User preference service — stores user notification preferences: which event types they want to receive, through which channels (push, email, SMS), quiet hours (do not send between 10 PM and 8 AM), and frequency caps (at most 5 marketing emails per week). (4) Template service — stores notification templates with placeholders: “Hi {user_name}, your order {order_id} has shipped!” Templates are versioned and support localization (English, Spanish, Japanese). (5) Delivery services — channel-specific services that handle the actual sending: push notification service (APNs for iOS, FCM for Android), email service (SES, SendGrid), SMS service (Twilio), and in-app notification service (WebSocket or polling). (6) Delivery tracking — records the status of each notification: queued, sent, delivered, opened, clicked, bounced, failed.

Notification Fanout

Fanout is the process of expanding a single event into individual notifications for each affected user. Types: (1) Single-user notification — OrderShipped affects one user. The notification service looks up the user, checks preferences, and sends one notification. Simple. (2) Group notification — a message in a group chat notifies all group members. The notification service fetches the group membership list and creates one notification per member (minus the sender). Moderate fanout: 10-100 users. (3) Broadcast notification — a popular user posts a new photo, notifying all 10 million followers. Massive fanout. This cannot be done synchronously — the notification service must queue the work. Implementation: read the follower list from the database (or a pre-computed follower cache), batch followers into chunks of 1000, and enqueue each chunk as a separate notification job. Workers process chunks in parallel. For celebrity users with millions of followers, the fanout can take minutes. Prioritize: deliver to active users first (users who opened the app in the last 24 hours), then to inactive users. This ensures the notification reaches the most engaged users quickly.

Delivery Channels

Push notifications (mobile): send via Apple Push Notification Service (APNs) for iOS and Firebase Cloud Messaging (FCM) for Android. The app registers a device token on install and sends it to the backend. The push service sends the notification payload to APNs/FCM with the device token. APNs/FCM delivers to the device. Challenges: device tokens change (when the user reinstalls the app), tokens can be invalid (user uninstalled), and delivery is not guaranteed (device may be offline). Email: send via Amazon SES, SendGrid, or Mailgun. Challenges: deliverability (avoiding spam filters), bounce handling (remove invalid addresses), and unsubscribe management (CAN-SPAM compliance requires a one-click unsubscribe link). SMS: send via Twilio or AWS SNS. Most expensive channel ($0.01-0.05 per message). Reserve for critical notifications: security alerts (2FA codes), delivery updates, payment confirmations. In-app notifications: stored in a notifications table and displayed when the user opens the app. Delivery: the app polls the notification API on load, or receives real-time updates via WebSocket/SSE. In-app notifications have the highest engagement rate because the user is already in the app.

Rate Limiting and User Experience

Notification fatigue is the biggest risk. Too many notifications cause users to disable notifications entirely or uninstall the app. Rate limiting strategies: (1) Per-user frequency caps — at most 5 push notifications per day, 2 marketing emails per week. Track notification counts per user in Redis (INCR with TTL). (2) Quiet hours — do not send non-critical notifications between 10 PM and 8 AM in the user local timezone. Queue them and send at the next available window. (3) Aggregation — instead of sending a notification for each new follower, aggregate: “3 people followed you today” as a single notification. Implement by buffering events for a time window (1 hour) and sending one aggregated notification. (4) Priority levels — critical (payment failed, security alert) bypass rate limits and quiet hours. High (order updates) respect quiet hours but not frequency caps. Low (marketing, social) respect all limits. (5) Smart delivery — ML model predicts the optimal time to deliver a notification based on the user historical engagement patterns. Send when the user is most likely to open it.

Reliability and Exactly-Once Delivery

Notification delivery must be reliable — a missed payment failure notification can cost the user money. At-least-once delivery: the notification service publishes to Kafka with acks=all. The delivery worker consumes, sends the notification, and commits the offset. If the worker crashes after sending but before committing, the notification is resent on restart. Duplicate delivery: the user receives the same notification twice. Deduplication: assign each notification a unique notification_id. Before sending, check if this ID has already been sent (lookup in Redis or the delivery tracking database). Skip if already sent. Retry strategy: if the delivery channel returns a transient error (APNs timeout, SES throttling), retry with exponential backoff (1s, 2s, 4s, 8s). After 3 retries, move to the dead letter queue for investigation. If the error is permanent (invalid device token, bounced email address), do not retry — mark the delivery as failed and update the user contact information. Delivery tracking: record the status of every notification: queued, sent, delivered (APNs/FCM delivery receipt), opened (email open tracking pixel, push notification open callback), clicked (link tracking). This data feeds analytics dashboards and ML models for notification optimization.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you handle notification fanout for users with millions of followers?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When a celebrity with 10 million followers posts content, the notification service must create 10 million individual notifications. This cannot be done synchronously. Fanout strategy: (1) Read the follower list from the database or a pre-computed follower cache. (2) Batch followers into chunks of 1000-5000. (3) Enqueue each chunk as a separate notification job in Kafka or SQS. (4) Worker instances process chunks in parallel, each sending notifications for its batch. (5) Prioritize: deliver to active users first (opened the app in the last 24 hours), then to inactive users. Active users are more likely to engage with the notification. For massive fanout (10M+), the process can take minutes. This is acceptable because users do not expect instant delivery of a social media notification. The system should degrade gracefully: if the notification worker falls behind, it processes the backlog without affecting other notifications. Separate queues for high-priority (payment alerts, security) and low-priority (social, marketing) notifications ensure critical notifications are not delayed by celebrity fanout.”}},{“@type”:”Question”,”name”:”How do you prevent notification fatigue and maintain user engagement?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Notification fatigue causes users to disable notifications or uninstall the app. Prevention: (1) Per-user frequency caps — maximum 5 push notifications per day, 2 marketing emails per week. Track counts in Redis with TTL-based expiration. (2) Quiet hours — no non-critical notifications between 10 PM and 8 AM in the user local timezone. Queue them for the next available window. (3) Aggregation — instead of one notification per event, batch: 3 people liked your post instead of three separate notifications. Buffer events for a time window and send one aggregated notification. (4) Priority levels — critical (payment failure, security alert) bypass all limits. High (order updates) respect quiet hours. Low (marketing, social) respect all limits. (5) Smart delivery timing — an ML model predicts the optimal send time based on the user historical open patterns. Send when engagement probability is highest. (6) Unsubscribe management — every notification includes an easy opt-out. Track per-channel, per-category preferences. Respect preferences strictly — sending unwanted notifications violates trust and may violate regulations (CAN-SPAM, GDPR).”}},{“@type”:”Question”,”name”:”How do you ensure reliable notification delivery?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Reliable delivery combines at-least-once delivery with deduplication. The notification service publishes to Kafka with acks=all (message is durably stored). Delivery workers consume, send the notification via the channel (APNs, SES, Twilio), and commit the Kafka offset. If a worker crashes after sending but before committing, the notification is resent on restart (at-least-once). Deduplication: each notification has a unique notification_id. Before sending, check Redis for this ID. If found, skip (already sent). If not, send and record the ID with a 24-hour TTL. Retry strategy: on transient errors (APNs timeout, SES throttling), retry with exponential backoff (1s, 2s, 4s). After 3 retries, move to a dead letter queue. On permanent errors (invalid device token, bounced email), do not retry — update the user contact record (mark token as invalid, mark email as bounced). Delivery tracking: record status transitions for every notification: queued, sent, delivered (channel delivery receipt), opened (email tracking pixel, push open callback), clicked (link redirect tracking). This data feeds engagement analytics and ML models.”}},{“@type”:”Question”,”name”:”What channels should a notification system support and when to use each?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Push notifications (mobile): instant delivery to the user device via APNs (iOS) or FCM (Android). Best for: time-sensitive alerts (ride arriving, delivery update), social interactions (new message, like). Limitations: user must have the app installed and notifications enabled, delivery not guaranteed if device is offline. Email: best for: detailed content (order confirmation, receipts), content the user may want to reference later, marketing (newsletters, promotions). Challenges: spam filters, bounce handling, deliverability reputation. SMS: most reliable delivery (works without internet), but most expensive ($0.01-0.05 per message). Reserve for: security codes (2FA), critical alerts (payment failure), delivery confirmations. Never use for marketing without explicit consent (regulations). In-app: displayed when the user opens the app. Highest engagement rate because the user is already active. Best for: activity feeds, recommendations, non-urgent updates. Channel selection logic: for each notification type, define a priority-ordered list of channels. A security alert: SMS + push + email. An order update: push + in-app. A marketing message: email only (unless the user opted in to push marketing).”}}]}
Scroll to Top