What Is an Online Status Service?
An online status service exposes a queryable view of whether users are currently active, idle, or offline. Unlike a raw presence service that manages connection-level state, an online status service adds a semantic layer: it aggregates multi-device presence, applies user-defined visibility rules (e.g., appear offline), and serves status queries from other services such as friend lists, profile pages, and notification routing. It is a read-heavy, eventually consistent system where slight staleness is acceptable.
Data Model
Three storage tiers serve different access patterns.
Redis (hot read cache)
status:{user_id} HASH fields: status, last_active_ms, device_count
status:visible:{user_id} STRING value = 'true' | 'false', no TTL (user preference)
status:bulk:{shard} ZSET score = last_active_ms, member = user_id
SQL Schema
CREATE TABLE user_status (
user_id BIGINT PRIMARY KEY,
status ENUM('online','idle','offline') NOT NULL DEFAULT 'offline',
last_active_at TIMESTAMP NOT NULL,
device_count SMALLINT NOT NULL DEFAULT 0,
visible BOOLEAN NOT NULL DEFAULT TRUE,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_last_active (last_active_at)
);
Core Algorithm: Aggregation and Idle Detection
A user may be logged in on multiple devices simultaneously. The status service must synthesize a single canonical status.
- Presence events arrive from the underlying presence service via Kafka topic
presence.updates. - Status aggregator reads the current
device_countfrom Redis. On asession_connectedevent it increments the counter; onsession_disconnectedit decrements. Status transitions toofflineonly whendevice_countreaches zero. - Idle detection: A background job scans the
status:bulk:{shard}sorted set for users whoselast_active_msscore is older than 5 minutes. It transitions those users toidleby updating their Redis hash and publishing astatus_changedevent. - Visibility filter: Before any status is returned to callers the service checks
status:visible:{user_id}. If the user has chosen to appear offline the API always returnsofflineregardless of true status.
Read Path: Bulk Status Queries
Friend lists, contact pages, and notification services query the status of many users in a single call. The read path is optimized for bulk access:
- Caller sends a list of up to 1,000 user IDs.
- Status service issues a Redis
HMGETpipeline across the relevantstatus:{user_id}hashes in a single round trip. - Cache misses (cold users) are fetched from SQL in a single
WHERE user_id IN (...)query and backfilled into Redis with a 60-second TTL. - Results are filtered through visibility rules before being returned.
Failure Handling
- Aggregator crash: Kafka consumer group re-balances and the aggregator replays from the last committed offset. Because status transitions are idempotent (setting the same status twice has no net effect), replay is safe.
- Redis unavailability: The service falls back to SQL for reads and queues writes to a local buffer (or a secondary Redis replica). A brief period of stale status data is acceptable.
- Idle scanner failure: Users remain in
onlinestate longer than intended. The impact is cosmetic; correctness is restored when the scanner recovers or the presence service eventually sends anofflineevent. - Split-brain multi-device: If two devices send conflicting events concurrently, a compare-and-swap on the
device_countfield with Redis WATCH/MULTI prevents counter corruption.
Scalability Considerations
The online status service is read-heavy: status is written once per session event but read thousands of times per second for friend lists, badge counts, and routing decisions.
- Read replicas: Route all bulk read queries to Redis read replicas. Replicate SQL status table to read replicas for fallback queries.
- Sharded aggregators: Partition the
presence.updatesKafka topic byuser_id mod N. Each aggregator shard owns a non-overlapping set of user IDs, eliminating cross-shard coordination. - CDN-cached public profiles: For public-facing status (e.g., creator profiles), cache the status response at the CDN edge with a 30-second TTL. This absorbs spikes without hitting the origin.
- Rate-limit polling clients: Mobile clients that poll status via REST rather than maintaining a WebSocket are throttled to one request per 30 seconds per user to prevent thundering-herd patterns.
Summary
An online status service sits above raw presence data and adds multi-device aggregation, idle detection, visibility controls, and a bulk-read-optimized query layer. Redis is the primary store for hot status data; SQL provides durability and fallback. Kafka decouples the write path (status aggregation) from the read path (query serving), allowing each to scale independently. The result is an eventually consistent, horizontally scalable service capable of serving millions of status queries per second with sub-10-millisecond p99 latency.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Snap Interview Guide