How are rotation schedules modeled in an on-call management system?

A rotation schedule is stored as a recurring rule (similar to iCalendar RRULE) specifying the rotation type (weekly round-robin, follow-the-sun by timezone), the ordered participant list, the shift start time, and the shift length. The scheduler expands the rule into concrete shift intervals on demand, so adding or removing participants updates future shifts without migrating historical data.

How does an on-call system handle schedule overrides?

An override is a time-bounded record that replaces the calculated on-call user for a specific interval with an explicitly named substitute. When resolving who is on-call at a given moment, the system first checks for any override covering that time; if found, the override user wins regardless of the rotation schedule. Overrides can be created by the scheduled user (swapping shifts) or by managers (emergency coverage).

How does a multi-tier escalation chain work in an on-call system?

The escalation chain defines an ordered sequence of layers, each pointing to a schedule or specific user and carrying an escalation delay. When an incident is created, the system pages layer 1 (the primary on-call). If unacknowledged within the delay, it pages layer 2 (secondary on-call or team lead). Each layer can have its own notification channels and retry count before advancing, ensuring incidents are never silently dropped.

How does automatic incident assignment work in an on-call system?

When an alert fires and creates an incident, the system resolves the current on-call user by evaluating the relevant rotation schedule and any active overrides at the alert's timestamp, then assigns the incident to that user and dispatches a notification via their configured channel preferences. If the alert carries a service label, the system first looks up the service's assigned escalation policy to determine which schedule to consult.

On-Call Management System Low-Level Design: Schedule Rotation, Escalation Policy, and Incident Assignment

⏱ 6 min read

What Is an On-Call Management System?

An on-call management system determines which engineer is responsible for responding to incidents at any given time, handles rotation scheduling and ad-hoc overrides, enforces escalation policies when the primary responder does not acknowledge an incident within a defined window, and integrates with alerting systems to automatically assign incoming incidents to the correct person.

Requirements

Functional Requirements

Define on-call schedules as rotation layers: who is on call, in what order, for what duration.
Support override entries: specific individuals who take over a rotation slot temporarily.
Determine the current on-call engineer for a given schedule and time via an API lookup.
Define multi-tier escalation policies: if no acknowledgment within N minutes, escalate to the next tier.
Automatically assign incoming incidents to the on-call engineer per the relevant escalation policy.
Send notifications (push, SMS, phone call) through integrated channels when an incident is assigned or escalated.

Non-Functional Requirements

On-call lookup must return in under 100 ms to avoid blocking incident creation in the alerting pipeline.
Schedule changes must take effect immediately without requiring system restarts.
The system must deliver escalation notifications within 30 seconds of the escalation trigger time.
Incident assignment and escalation state must be durable across service restarts.

Data Model

Schedule

schedule_id (UUID), name, team_id
layers (JSONB array of layer definitions, ordered by priority)

RotationLayer (within the JSONB array)

layer_id (UUID), name
rotation_type (ENUM: daily, weekly, custom)
handoff_time (time of day for rotation handoff, e.g. “09:00”)
participants (ordered array of user_ids)
rotation_start (timestamp: when the rotation began, used to compute current slot)

Override

override_id (UUID), schedule_id, layer_id
user_id (who is covering)
start_time, end_time

EscalationPolicy

policy_id (UUID), name, team_id
steps (JSONB array: each step has timeout_minutes, target_type (schedule/user/team), target_id)

Incident

incident_id (UUID), title, severity
policy_id (escalation policy to apply)
current_step (integer index into policy steps)
assigned_to (user_id of current assignee)
status (ENUM: triggered, acknowledged, resolved)
triggered_at, acknowledged_at, resolved_at
next_escalation_at (timestamp when escalation fires if no acknowledgment)

Core Algorithms

On-Call Slot Computation

For a given schedule and query time T, the system determines the on-call engineer by:

For each layer (in priority order), compute the current slot index: elapsed_intervals = floor((T – rotation_start) / interval_duration). participant_index = elapsed_intervals modulo len(participants).
Check if any Override record covers this layer and time T. If so, the override user replaces the rotation participant.
The highest-priority layer with a covered time window determines the final on-call user for that schedule.

Multiple layers allow complex schedules: for example, a primary layer covers weekdays and a secondary layer covers weekends, with the system selecting the appropriate layer based on T.

Escalation State Machine

When an incident is created, the system assigns it to the on-call engineer for step 0 of the escalation policy and sets next_escalation_at = triggered_at + step[0].timeout_minutes. A background escalation runner queries for incidents where next_escalation_at has passed and status is still TRIGGERED. For each such incident, it:

Increments current_step.
Resolves the target for the new step (may be a schedule lookup, a specific user, or all members of a team).
Updates assigned_to and sends notifications to the new assignee.
Sets next_escalation_at = now + step[current_step].timeout_minutes.
If current_step exceeds the last policy step, the incident is flagged as unacknowledged-max-escalation and the team manager is notified.

Scalability

On-call schedule lookups are computationally cheap (a few arithmetic operations) and can be served from an in-memory cache of schedule definitions populated at startup and invalidated via a Redis Pub/Sub channel on any schedule change. This allows sub-millisecond lookups without database reads on the hot path.

Escalation timers are managed by storing next_escalation_at in the Incident table. A polling loop (runs every 10 seconds) performs a simple indexed query: SELECT * FROM incidents WHERE next_escalation_at <= now AND status = triggered. This avoids distributed timer complexity; the index on (next_escalation_at, status) makes the query fast even with millions of historical incidents.

API Design

POST /schedules — create a schedule with rotation layers.
PATCH /schedules/{schedule_id} — update rotation participants, handoff times, or add layers.
GET /schedules/{schedule_id}/oncall?time={ISO8601} — return the on-call user for the given time (defaults to now).
POST /schedules/{schedule_id}/overrides — create an override for a specific time window.
POST /policies — create an escalation policy with step definitions.
POST /incidents — create an incident and trigger the associated escalation policy.
POST /incidents/{incident_id}/acknowledge — acknowledge an incident, halting further escalation.
POST /incidents/{incident_id}/resolve — resolve the incident.

Failure Modes

Escalation runner crash: On restart, the runner immediately queries for overdue escalations and catches up. A brief delay in escalation is acceptable; missing escalations entirely is not. The indexed query guarantees recovery within one polling interval (10 seconds).
Notification channel unavailable: Notifications are retried with backoff. The incident assignment is recorded regardless of notification delivery success, so the assignee can see their assignment on next login even if they did not receive the push notification.
Schedule misconfiguration (no participants): The on-call lookup returns a null result. The incident creation API rejects incident creation with a policy that resolves to a null assignee and returns a descriptive error, prompting the caller to fix the schedule or specify a fallback user.

Observability

Track mean time to acknowledgment (MTTA) per team and severity, escalation rate (what percentage of incidents require escalation beyond step 0), notification delivery latency, override coverage percentage (what fraction of rotation time is covered by overrides, indicating schedule gaps), and on-call lookup latency. Alert when MTTA exceeds the team-configured SLA or when any escalation policy has a step targeting a schedule with zero active participants.