Low Level Design: On-Call Management Service

What Is an On-Call Management Service?

An on-call management service determines which engineer is responsible for responding to an alert at any given moment. It manages rotation schedules, escalation policies, and acknowledgment workflows. PagerDuty and Opsgenie are the canonical commercial implementations. Building one requires careful handling of time zones, schedule overrides, escalation chains, and exactly-once notification delivery — all under the constraint that the system must remain available even during the incidents it is meant to route.

Data Model

CREATE TABLE teams (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        VARCHAR(100) NOT NULL UNIQUE
);

CREATE TABLE schedules (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    team_id     UUID REFERENCES teams(id),
    name        VARCHAR(100) NOT NULL,
    timezone    VARCHAR(50) NOT NULL   -- e.g. 'America/New_York'
);

CREATE TABLE rotation_layers (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    schedule_id     UUID REFERENCES schedules(id),
    rotation_type   VARCHAR(20) NOT NULL,  -- 'weekly', 'daily', 'custom'
    handoff_time    TIME NOT NULL,          -- local time in schedule timezone
    participants    JSONB NOT NULL,         -- ordered list of user IDs
    layer_order     INT NOT NULL
);

CREATE TABLE overrides (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    schedule_id     UUID REFERENCES schedules(id),
    user_id         UUID NOT NULL,
    starts_at       TIMESTAMPTZ NOT NULL,
    ends_at         TIMESTAMPTZ NOT NULL
);

CREATE TABLE escalation_policies (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    team_id         UUID REFERENCES teams(id),
    name            VARCHAR(100) NOT NULL
);

CREATE TABLE escalation_steps (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    policy_id       UUID REFERENCES escalation_policies(id),
    step_order      INT NOT NULL,
    target_type     VARCHAR(20) NOT NULL,  -- 'schedule', 'user', 'team'
    target_id       UUID NOT NULL,
    escalate_after  INT NOT NULL           -- seconds before escalating
);

CREATE TABLE incidents (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    alert_id        UUID NOT NULL,
    policy_id       UUID REFERENCES escalation_policies(id),
    current_step    INT NOT NULL DEFAULT 0,
    status          VARCHAR(20) NOT NULL,  -- 'triggered', 'acknowledged', 'resolved'
    triggered_at    TIMESTAMPTZ NOT NULL,
    acked_at        TIMESTAMPTZ,
    resolved_at     TIMESTAMPTZ
);

Core Algorithm and Workflow

When an alert fires, the system creates an incident and begins executing the escalation policy:

Resolve on-call user: Given a schedule and the current UTC timestamp, compute which rotation layer is active, identify the participant at the correct index (accounting for the number of full rotation periods elapsed since a reference start date), then check if any override covers the current time window. Overrides always win.
Notify: Send the incident to the resolved user via their configured contact methods (SMS, push, phone call, Slack) in priority order.
Start escalation timer: Enqueue a delayed job (e.g., via a durable task queue like Celery or a DB-backed scheduler) set to fire after escalate_after seconds.
Acknowledgment: If the on-call engineer acknowledges before the timer fires, cancel the timer and mark the incident as acknowledged. Acknowledged incidents pause further escalation.
Escalate: If the timer fires and the incident is still in the triggered state, increment current_step, resolve the next step target, and repeat. If all steps are exhausted, notify the team as a whole and mark the incident as unacknowledged-at-final-step.

Failure Handling and Reliability

Scheduler durability: Escalation timers must survive service restarts. Store pending timers in the database (scheduled_jobs table with run_at timestamp) and poll with a SELECT FOR UPDATE SKIP LOCKED pattern to ensure exactly-one execution across multiple scheduler instances.
Notification idempotency: Each notification attempt is logged with an idempotency key derived from (incident_id, step, contact_method). Retries skip already-delivered contacts.
High availability: Run multiple instances of the on-call resolver and scheduler. Use leader election (via a DB advisory lock or Redis SETNX with TTL) only for the scheduler to prevent duplicate escalations. The notify path is stateless and safe to run in parallel.
Time zone correctness: Always store timestamps in UTC. Convert to the schedule’s local timezone only at resolution time using a reliable timezone library (e.g., pytz with the IANA database). DST transitions are handled by the library, not by your code.

Scalability Considerations

Read-heavy schedule resolution: The on-call lookup is computed on every incoming alert. Cache the result per (schedule_id, 5-minute bucket) in Redis, invalidated whenever a rotation or override changes.
Large organizations: Teams with hundreds of schedules and thousands of overrides benefit from an index on overrides(schedule_id, starts_at, ends_at) and a partial index filtering to future/current overrides only.
Multi-region: Replicate the schedule and escalation data globally (read replicas or a multi-region DB like CockroachDB). Incident state needs strong consistency; schedule reads can tolerate a few seconds of lag.
Audit trail: Every state transition of an incident (step change, ack, resolve) should be appended to an immutable incident_events log table. This supports post-incident review and is cheap to write append-only.

Summary

On-call management is fundamentally a scheduling and state machine problem. The hardest parts are computing the correct on-call person across rotation layers, overrides, and time zones, and ensuring escalation timers fire exactly once even under partial failures. Use a DB-backed job queue for durability, cache schedule resolution aggressively, and log every state transition for auditability. The system must itself be highly available — it is the last line of defense when everything else is broken.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What should an on-call management system include in its low-level design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An on-call management system should include schedule management with rotation policies (weekly, follow-the-sun), escalation chains, override and swap capabilities, multi-channel notification delivery (SMS, voice, push, email), acknowledgment tracking, incident linking, and reporting dashboards for on-call health metrics.”
}
},
{
“@type”: “Question”,
“name”: “How do you model on-call schedules in a database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On-call schedules are typically modeled as recurring time intervals associated with a user or team, stored as schedule layers with priority ordering. A common schema includes teams, schedules, layers, rotations, and overrides tables. Rendering the final on-call timeline requires merging layers by priority and applying any active overrides in chronological order.”
}
},
{
“@type”: “Question”,
“name”: “How do you ensure reliable alert delivery in an on-call system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reliable delivery is achieved through at-least-once delivery guarantees using persistent queues, retry logic with exponential backoff, multi-channel fallback (e.g., try push, then SMS, then voice call), and delivery receipts. The system should track acknowledgment state and automatically escalate to the next responder if no acknowledgment is received within a configured timeout.”
}
},
{
“@type”: “Question”,
“name”: “How do companies like Atlassian, Google, and Amazon handle on-call management internally?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Atlassian offers Opsgenie as a dedicated on-call product with full schedule, escalation, and incident management capabilities. Google uses internal tooling with SRE-driven rotation practices and tight integration with their alerting infrastructure. Amazon embeds on-call management into its DevOps culture via tools integrated with their internal ticketing and paging systems, emphasizing clear ownership through their two-pizza team model.”
}
}
]
}