SRE: Incident Management and On-Call — Rotation, Runbooks, Postmortems, Blameless Culture, PagerDuty, Escalation

Incident management is the process of detecting, responding to, and resolving production outages. A well-designed incident management process minimizes downtime, reduces stress on engineers, and turns incidents into learning opportunities. This guide covers on-call rotations, incident response workflows, runbooks, and blameless postmortems — essential knowledge for SRE and engineering manager interviews.

On-Call Rotation Design

On-call rotation ensures an engineer is always available to respond to production incidents. Design principles: (1) Rotation length — 1 week is the most common. Shorter rotations (2-3 days) reduce fatigue but increase handoff frequency. (2) Minimum team size — at least 6-8 engineers in the rotation to prevent burnout (on-call every 6-8 weeks). If fewer, consider combining teams or reducing on-call scope. (3) Primary and secondary — the primary on-call is the first responder. If the primary does not acknowledge within 5 minutes, escalate to the secondary. (4) Follow-the-sun — for global teams, hand off on-call to the team in the next timezone. No engineer is paged at 3 AM. (5) Compensation — on-call should be compensated (additional pay, time off, or both). Uncompensated on-call leads to resentment and attrition. (6) Handoff — at rotation change, the outgoing on-call briefs the incoming on-call on current issues, recent deployments, and known risks. PagerDuty, Opsgenie, and Grafana OnCall manage rotations, escalation policies, and paging (phone call, SMS, push notification, Slack).

Incident Severity Levels

Severity levels standardize the response based on impact. (1) SEV-1 (Critical) — complete service outage or data loss affecting all users. Response: page on-call immediately, open an incident channel, assemble an incident response team, communicate to leadership every 15 minutes. Target resolution: under 1 hour. (2) SEV-2 (Major) — partial outage or severe degradation affecting a significant subset of users. Response: page on-call, open an incident channel, communicate to stakeholders every 30 minutes. Target resolution: under 4 hours. (3) SEV-3 (Minor) — minor issue affecting a small number of users or a non-critical feature. Response: create a ticket, on-call investigates during business hours. Target resolution: 1 business day. (4) SEV-4 (Low) — cosmetic issue, non-urgent improvement. Response: backlog ticket. Severity determines: who is paged (SEV-1: on-call + manager + VP engineering; SEV-2: on-call; SEV-3: ticket only), communication cadence, and escalation path. Document severity definitions in an incident response playbook so the on-call engineer can quickly classify without judgment calls at 3 AM.

Incident Response Workflow

Structured incident response: (1) Detection — alert fires from monitoring (PagerDuty page) or customer report. The on-call engineer acknowledges the alert within 5 minutes. (2) Triage — assess severity and impact. How many users affected? Which services are impacted? Is data at risk? Assign severity level. (3) Communication — open a dedicated Slack channel (#incident-2026-04-20-api-outage). Post the initial assessment: what is known, what is unknown, severity level. For SEV-1/2, notify stakeholders (engineering leadership, customer support, communications team). (4) Investigation and mitigation — the on-call engineer investigates using dashboards, logs, and traces. Priority is mitigation (restore service), not root cause. Common mitigations: roll back the latest deployment, scale up capacity, restart unhealthy services, enable a feature flag to disable the broken feature. (5) Resolution — service is restored. Confirm with monitoring that metrics have returned to baseline. Communicate all-clear. (6) Follow-up — update the incident ticket with a timeline. Schedule a postmortem within 48 hours. Roles during a SEV-1 incident: Incident Commander (coordinates), Communications Lead (updates stakeholders), and Technical Lead (drives investigation).

Runbooks

A runbook is a step-by-step guide for diagnosing and mitigating a specific type of incident. Purpose: enable any on-call engineer (even one unfamiliar with the service) to respond effectively. Runbook template: (1) Alert name and description — “API Error Rate > 5%” (2) Impact — what user-facing functionality is affected. (3) Quick check — is this a known issue? Check the incident channel and recent deployment history. (4) Diagnosis steps — check the error rate dashboard (link), check recent deployments (link to deployment history), check database health (link to RDS metrics), check dependency health (link to upstream service dashboard). (5) Mitigation steps — if caused by a deployment: roll back using (specific command or link). If database is overloaded: enable read-only mode. If upstream dependency is down: enable circuit breaker fallback. (6) Escalation — if the above steps do not resolve the issue, page the service team lead (name, PagerDuty handle). Runbook maintenance: review runbooks quarterly. After every incident, update the runbook if the diagnosis or mitigation steps were missing or incorrect. Stale runbooks are worse than no runbook (they waste time on wrong steps). Link every alert to its runbook in the alert metadata (PagerDuty custom details).

Blameless Postmortems

A postmortem is a structured review conducted after every SEV-1 and SEV-2 incident. Purpose: understand what happened, why it happened, and how to prevent recurrence. “Blameless” means focusing on systems and processes, not individual mistakes. Postmortem template: (1) Summary — one paragraph describing the incident. (2) Impact — duration, users affected, business impact (revenue lost, SLA breach). (3) Timeline — detailed chronology from detection to resolution. (4) Root cause — the technical root cause (a missing database index caused a full table scan under load). (5) Contributing factors — process failures that amplified the impact (the monitoring alert threshold was too high, the runbook was outdated). (6) What went well — what worked during the response (fast detection, effective rollback). (7) Action items — specific, assigned, time-bound tasks to prevent recurrence. “Add the missing index (owner: Alice, due: April 25)” not “improve database performance.” Review the postmortem in a team meeting. Discuss openly. The goal is learning, not punishment. If an engineer made a mistake, ask: what about the system allowed this mistake to cause an outage? The fix is in the system (better testing, better guards), not in the person.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How should you structure on-call rotations?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”On-call rotation principles: (1) 1-week rotation is most common. (2) Minimum 6-8 engineers to prevent burnout (on-call every 6-8 weeks). (3) Primary and secondary — if primary does not acknowledge within 5 minutes, escalate to secondary. (4) Follow-the-sun for global teams — hand off to the next timezone so no one is paged at 3 AM. (5) Compensate on-call (additional pay or time off). Uncompensated on-call leads to attrition. (6) Handoff briefing at rotation change: current issues, recent deployments, known risks. Tools: PagerDuty, Opsgenie, or Grafana OnCall manage rotations, escalation policies, and paging via phone, SMS, push, and Slack.”}},{“@type”:”Question”,”name”:”What are incident severity levels and why do they matter?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Severity levels standardize response based on impact: SEV-1 (Critical): complete outage or data loss affecting all users. Page on-call immediately, open incident channel, communicate to leadership every 15 minutes. Target: resolve under 1 hour. SEV-2 (Major): partial outage or significant degradation. Page on-call, incident channel, updates every 30 minutes. Target: under 4 hours. SEV-3 (Minor): small number of users affected or non-critical feature. Create ticket, investigate during business hours. Target: 1 business day. SEV-4 (Low): cosmetic issues. Backlog ticket. Severity determines who is paged (SEV-1: on-call + manager + VP; SEV-2: on-call only; SEV-3: ticket), communication cadence, and escalation path. Document definitions clearly so the 3 AM on-call engineer can classify without judgment calls.”}},{“@type”:”Question”,”name”:”What is a blameless postmortem and why is it important?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A blameless postmortem is a structured review after SEV-1/2 incidents focused on systems and processes, not individual mistakes. Template: summary, impact (duration, users affected, business cost), detailed timeline, root cause, contributing factors (process failures that amplified impact), what went well, and action items (specific, assigned, time-bound). Key principle: if an engineer made a mistake, ask what about the system allowed this mistake to cause an outage? The fix is in the system (better testing, better guards, automation), not in the person. Example: instead of Bob deployed a bad config, write deployment process allowed a config change without validation — action: add schema validation to config deployment pipeline. Review postmortems in a team meeting. The goal is organizational learning. Blame-oriented cultures suppress incident reporting and prevent improvement.”}},{“@type”:”Question”,”name”:”What should a good runbook contain?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A runbook is a step-by-step guide for diagnosing and mitigating a specific incident type. Template: (1) Alert name and description. (2) Impact — what user-facing functionality is affected. (3) Quick check — is this a known issue? Check incident channel and recent deployments. (4) Diagnosis steps with dashboard links: check error rate dashboard, check recent deployments, check database health, check upstream dependencies. Each step includes a direct link to the relevant tool. (5) Mitigation steps: if caused by deployment then rollback (exact command), if database overloaded then enable read-only mode, if upstream down then enable circuit breaker. (6) Escalation: if above steps fail, page team lead (name, PagerDuty handle). Maintenance: update after every incident where the runbook was missing or incorrect. Link every alert to its runbook in PagerDuty metadata. Stale runbooks waste time on wrong steps.”}}]}
Scroll to Top