What Is an On-Call Management Service?
An on-call management service determines which engineer is responsible for responding to an alert at any given moment. It manages rotation schedules, escalation policies, and acknowledgment workflows. PagerDuty and Opsgenie are the canonical commercial implementations. Building one requires careful handling of time zones, schedule overrides, escalation chains, and exactly-once notification delivery — all under the constraint that the system must remain available even during the incidents it is meant to route.
Data Model
CREATE TABLE teams (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(100) NOT NULL UNIQUE
);
CREATE TABLE schedules (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
team_id UUID REFERENCES teams(id),
name VARCHAR(100) NOT NULL,
timezone VARCHAR(50) NOT NULL -- e.g. 'America/New_York'
);
CREATE TABLE rotation_layers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
schedule_id UUID REFERENCES schedules(id),
rotation_type VARCHAR(20) NOT NULL, -- 'weekly', 'daily', 'custom'
handoff_time TIME NOT NULL, -- local time in schedule timezone
participants JSONB NOT NULL, -- ordered list of user IDs
layer_order INT NOT NULL
);
CREATE TABLE overrides (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
schedule_id UUID REFERENCES schedules(id),
user_id UUID NOT NULL,
starts_at TIMESTAMPTZ NOT NULL,
ends_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE escalation_policies (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
team_id UUID REFERENCES teams(id),
name VARCHAR(100) NOT NULL
);
CREATE TABLE escalation_steps (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
policy_id UUID REFERENCES escalation_policies(id),
step_order INT NOT NULL,
target_type VARCHAR(20) NOT NULL, -- 'schedule', 'user', 'team'
target_id UUID NOT NULL,
escalate_after INT NOT NULL -- seconds before escalating
);
CREATE TABLE incidents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
alert_id UUID NOT NULL,
policy_id UUID REFERENCES escalation_policies(id),
current_step INT NOT NULL DEFAULT 0,
status VARCHAR(20) NOT NULL, -- 'triggered', 'acknowledged', 'resolved'
triggered_at TIMESTAMPTZ NOT NULL,
acked_at TIMESTAMPTZ,
resolved_at TIMESTAMPTZ
);
Core Algorithm and Workflow
When an alert fires, the system creates an incident and begins executing the escalation policy:
- Resolve on-call user: Given a schedule and the current UTC timestamp, compute which rotation layer is active, identify the participant at the correct index (accounting for the number of full rotation periods elapsed since a reference start date), then check if any override covers the current time window. Overrides always win.
- Notify: Send the incident to the resolved user via their configured contact methods (SMS, push, phone call, Slack) in priority order.
- Start escalation timer: Enqueue a delayed job (e.g., via a durable task queue like Celery or a DB-backed scheduler) set to fire after
escalate_afterseconds. - Acknowledgment: If the on-call engineer acknowledges before the timer fires, cancel the timer and mark the incident as acknowledged. Acknowledged incidents pause further escalation.
- Escalate: If the timer fires and the incident is still in the triggered state, increment
current_step, resolve the next step target, and repeat. If all steps are exhausted, notify the team as a whole and mark the incident as unacknowledged-at-final-step.
Failure Handling and Reliability
- Scheduler durability: Escalation timers must survive service restarts. Store pending timers in the database (
scheduled_jobstable withrun_attimestamp) and poll with a SELECT FOR UPDATE SKIP LOCKED pattern to ensure exactly-one execution across multiple scheduler instances. - Notification idempotency: Each notification attempt is logged with an idempotency key derived from
(incident_id, step, contact_method). Retries skip already-delivered contacts. - High availability: Run multiple instances of the on-call resolver and scheduler. Use leader election (via a DB advisory lock or Redis SETNX with TTL) only for the scheduler to prevent duplicate escalations. The notify path is stateless and safe to run in parallel.
- Time zone correctness: Always store timestamps in UTC. Convert to the schedule’s local timezone only at resolution time using a reliable timezone library (e.g., pytz with the IANA database). DST transitions are handled by the library, not by your code.
Scalability Considerations
- Read-heavy schedule resolution: The on-call lookup is computed on every incoming alert. Cache the result per
(schedule_id, 5-minute bucket)in Redis, invalidated whenever a rotation or override changes. - Large organizations: Teams with hundreds of schedules and thousands of overrides benefit from an index on
overrides(schedule_id, starts_at, ends_at)and a partial index filtering to future/current overrides only. - Multi-region: Replicate the schedule and escalation data globally (read replicas or a multi-region DB like CockroachDB). Incident state needs strong consistency; schedule reads can tolerate a few seconds of lag.
- Audit trail: Every state transition of an incident (step change, ack, resolve) should be appended to an immutable
incident_eventslog table. This supports post-incident review and is cheap to write append-only.
Summary
On-call management is fundamentally a scheduling and state machine problem. The hardest parts are computing the correct on-call person across rotation layers, overrides, and time zones, and ensuring escalation timers fire exactly once even under partial failures. Use a DB-backed job queue for durability, cache schedule resolution aggressively, and log every state transition for auditability. The system must itself be highly available — it is the last line of defense when everything else is broken.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What should an on-call management system include in its low-level design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An on-call management system should include schedule management with rotation policies (weekly, follow-the-sun), escalation chains, override and swap capabilities, multi-channel notification delivery (SMS, voice, push, email), acknowledgment tracking, incident linking, and reporting dashboards for on-call health metrics.”
}
},
{
“@type”: “Question”,
“name”: “How do you model on-call schedules in a database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On-call schedules are typically modeled as recurring time intervals associated with a user or team, stored as schedule layers with priority ordering. A common schema includes teams, schedules, layers, rotations, and overrides tables. Rendering the final on-call timeline requires merging layers by priority and applying any active overrides in chronological order.”
}
},
{
“@type”: “Question”,
“name”: “How do you ensure reliable alert delivery in an on-call system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reliable delivery is achieved through at-least-once delivery guarantees using persistent queues, retry logic with exponential backoff, multi-channel fallback (e.g., try push, then SMS, then voice call), and delivery receipts. The system should track acknowledgment state and automatically escalate to the next responder if no acknowledgment is received within a configured timeout.”
}
},
{
“@type”: “Question”,
“name”: “How do companies like Atlassian, Google, and Amazon handle on-call management internally?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Atlassian offers Opsgenie as a dedicated on-call product with full schedule, escalation, and incident management capabilities. Google uses internal tooling with SRE-driven rotation practices and tight integration with their alerting infrastructure. Amazon embeds on-call management into its DevOps culture via tools integrated with their internal ticketing and paging systems, emphasizing clear ownership through their two-pizza team model.”
}
}
]
}
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering