Question 1

How do you design a token management system for push notifications across FCM and APNs?

Accepted Answer

Store device tokens in a `device_tokens` table: (token_id PK, user_id, platform ENUM('fcm','apns'), token TEXT, app_version, created_at, last_seen_at, is_valid BOOL). Index on (user_id, platform) for fan-out queries. Tokens are invalidated when: (1) FCM returns a 'registration-token-not-registered' error — set is_valid=false immediately; (2) APNs returns a 410 Gone with an unregister timestamp — invalidate if the token's registration predates the unregister time; (3) a new token is registered for the same (user_id, platform, device_fingerprint) — invalidate the old one. Clients refresh tokens on app open and after FCM token rotation events (FCM rotates tokens periodically). Run a nightly cleanup job to hard-delete tokens invalid for >30 days. Token volume can be large (billions for a major app) — shard the table by user_id hash if a single Postgres instance can't handle the write rate.

Question 2

How do you abstract FCM and APNs behind a unified push gateway interface?

Accepted Answer

Define a provider interface: send(token, payload) → (provider_message_id, error). Implement FCM and APNs adapters behind this interface. The gateway's dispatch layer selects the adapter based on the token's platform field — callers never reference FCM or APNs directly. Maintain separate connection pools and credential managers per provider: FCM uses OAuth2 tokens (v1 HTTP API) refreshed every 3600s; APNs uses TLS client certificates or JWT tokens (ES256, valid 60 min) with HTTP/2 persistent connections (APNs requires multiplexed HTTP/2 — reusing connections is critical for throughput). Abstract payload differences: FCM uses `notification` + `data` keys; APNs uses `aps` + custom keys. A unified `PushPayload` struct is translated to provider-specific JSON by each adapter. This lets you add new providers (e.g., Huawei HMS) by adding an adapter without touching upstream callers.

Question 3

How do you design a fanout system to send push notifications to millions of devices efficiently?

Accepted Answer

For a broadcast to M users each with average D devices, a naive approach requires M×D sequential API calls — too slow. Instead: (1) partition the target user set into batches (e.g., 10K users each); (2) for each batch, query device_tokens WHERE user_id IN (...) AND is_valid=true; (3) dispatch tokens to a Kafka topic partitioned by token prefix hash; (4) push workers consume from partitions and call FCM/APNs. FCM supports batch send (up to 500 tokens per HTTP request via the legacy API, or 1 per request with v1 but with high parallelism); APNs requires one HTTP/2 request per token but supports up to 1000 concurrent streams per connection. Size the push worker fleet for peak throughput: at 10K sends/sec per worker with 50ms average APNs RTT, you need 500 in-flight requests per worker — use async I/O (e.g., asyncio, Netty) not thread-per-request. Monitor tokens-per-second and error rate per provider as primary throughput SLIs.

Question 4

How do you handle APNs and FCM errors at scale and maintain token validity without manual intervention?

Accepted Answer

Classify provider errors into three categories: (1) Transient (FCM: 500/503, APNs: 429 TooManyRequests) — retry with exponential backoff and jitter, up to 3 attempts; back-pressure the upstream queue if retry rate exceeds a threshold. (2) Permanent-token (FCM: 'registration-token-not-registered', APNs: 410 with Apns-Id timestamp) — immediately mark the token is_valid=false in the DB and publish a `token.invalidated` event for analytics; do not retry. (3) Permanent-credential (FCM: 401, APNs: 403) — alert on-call, pause the provider adapter, do not retry individual sends. Process error responses asynchronously: push workers write raw provider responses to an `error_log` Kafka topic; a separate error processor applies the classification logic and DB updates, decoupling error handling latency from send throughput. Measure token churn rate (invalidations/day ÷ total valid tokens) as a health signal — a spike indicates a client bug causing token re-registration loops.

Push Notification Gateway Low-Level Design: Token Management, Provider Abstraction, and Fanout at Scale

Device Token Storage

Token Lifecycle Management

Provider Abstraction Layer

Fanout at Scale

Priority and Collapse Keys

Payload Size Limits and Silent Push

Bulk Push Job Architecture

Delivery Rate Monitoring and Certificate Rotation