Low Level Design: Online Status Service

What Is an Online Status Service?

An online status service exposes a queryable view of whether users are currently active, idle, or offline. Unlike a raw presence service that manages connection-level state, an online status service adds a semantic layer: it aggregates multi-device presence, applies user-defined visibility rules (e.g., appear offline), and serves status queries from other services such as friend lists, profile pages, and notification routing. It is a read-heavy, eventually consistent system where slight staleness is acceptable.

Data Model

Three storage tiers serve different access patterns.

Redis (hot read cache)

status:{user_id}            HASH    fields: status, last_active_ms, device_count
status:visible:{user_id}    STRING  value = 'true' | 'false', no TTL (user preference)
status:bulk:{shard}         ZSET    score = last_active_ms, member = user_id

SQL Schema

CREATE TABLE user_status (
    user_id          BIGINT PRIMARY KEY,
    status           ENUM('online','idle','offline') NOT NULL DEFAULT 'offline',
    last_active_at   TIMESTAMP NOT NULL,
    device_count     SMALLINT NOT NULL DEFAULT 0,
    visible          BOOLEAN NOT NULL DEFAULT TRUE,
    updated_at       TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_last_active (last_active_at)
);

Core Algorithm: Aggregation and Idle Detection

A user may be logged in on multiple devices simultaneously. The status service must synthesize a single canonical status.

Presence events arrive from the underlying presence service via Kafka topic presence.updates.
Status aggregator reads the current device_count from Redis. On a session_connected event it increments the counter; on session_disconnected it decrements. Status transitions to offline only when device_count reaches zero.
Idle detection: A background job scans the status:bulk:{shard} sorted set for users whose last_active_ms score is older than 5 minutes. It transitions those users to idle by updating their Redis hash and publishing a status_changed event.
Visibility filter: Before any status is returned to callers the service checks status:visible:{user_id}. If the user has chosen to appear offline the API always returns offline regardless of true status.

Read Path: Bulk Status Queries

Friend lists, contact pages, and notification services query the status of many users in a single call. The read path is optimized for bulk access:

Caller sends a list of up to 1,000 user IDs.
Status service issues a Redis HMGET pipeline across the relevant status:{user_id} hashes in a single round trip.
Cache misses (cold users) are fetched from SQL in a single WHERE user_id IN (...) query and backfilled into Redis with a 60-second TTL.
Results are filtered through visibility rules before being returned.

Failure Handling

Aggregator crash: Kafka consumer group re-balances and the aggregator replays from the last committed offset. Because status transitions are idempotent (setting the same status twice has no net effect), replay is safe.
Redis unavailability: The service falls back to SQL for reads and queues writes to a local buffer (or a secondary Redis replica). A brief period of stale status data is acceptable.
Idle scanner failure: Users remain in online state longer than intended. The impact is cosmetic; correctness is restored when the scanner recovers or the presence service eventually sends an offline event.
Split-brain multi-device: If two devices send conflicting events concurrently, a compare-and-swap on the device_count field with Redis WATCH/MULTI prevents counter corruption.

Scalability Considerations

The online status service is read-heavy: status is written once per session event but read thousands of times per second for friend lists, badge counts, and routing decisions.

Read replicas: Route all bulk read queries to Redis read replicas. Replicate SQL status table to read replicas for fallback queries.
Sharded aggregators: Partition the presence.updates Kafka topic by user_id mod N. Each aggregator shard owns a non-overlapping set of user IDs, eliminating cross-shard coordination.
CDN-cached public profiles: For public-facing status (e.g., creator profiles), cache the status response at the CDN edge with a 30-second TTL. This absorbs spikes without hitting the origin.
Rate-limit polling clients: Mobile clients that poll status via REST rather than maintaining a WebSocket are throttled to one request per 30 seconds per user to prevent thundering-herd patterns.

Summary

An online status service sits above raw presence data and adds multi-device aggregation, idle detection, visibility controls, and a bulk-read-optimized query layer. Redis is the primary store for hot status data; SQL provides durability and fallback. Kafka decouples the write path (status aggregation) from the read path (query serving), allowing each to scale independently. The result is an eventually consistent, horizontally scalable service capable of serving millions of status queries per second with sub-10-millisecond p99 latency.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design an online status system for a social platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An online status system typically relies on clients sending periodic heartbeats (e.g., every 30 seconds) to a presence server over a persistent connection such as WebSocket. The server stores status in a fast in-memory store like Redis with a TTL slightly longer than the heartbeat interval. If a heartbeat is missed, the TTL expires and the user is marked offline. Status changes are propagated to followers or friends via a pub/sub layer.”
}
},
{
“@type”: “Question”,
“name”: “How do you scale online status reads for users with millions of followers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For celebrity or high-fan-out users, direct push of status changes to all followers is impractical. Instead, systems use a pull-on-demand model where a follower’s client fetches status only when the user opens a conversation or profile. Caching status at the CDN or application layer with a short TTL (e.g., 60 seconds) further reduces read load. Fan-out on write is reserved for users with manageable follower counts.”
}
},
{
“@type”: “Question”,
“name”: “How do privacy controls affect online status system design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Privacy controls add an authorization layer between status reads and the data store. When a user sets their status to hidden or restricts visibility to friends only, the system must check the requesting user’s relationship before returning real status. This is typically enforced at the API gateway or presence service layer using a cached friends/permissions graph. Some platforms return a fake ‘offline’ status rather than a permission-denied response to avoid leaking that the user is hiding their status.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between online status and last-seen timestamps?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Online status is a binary or categorical real-time signal (online, offline, idle) updated continuously via heartbeats. Last-seen is a historical timestamp recorded when the user was last active or disconnected, and it persists even when the user is offline. Systems often store both: online status in Redis with a TTL for fast expiry, and last-seen as a durable write to a relational or NoSQL database updated on each disconnection event.”
}
}
]
}