Question 1

How should you structure on-call rotations?

Accepted Answer

On-call rotation principles: (1) 1-week rotation is most common. (2) Minimum 6-8 engineers to prevent burnout (on-call every 6-8 weeks). (3) Primary and secondary -- if primary does not acknowledge within 5 minutes, escalate to secondary. (4) Follow-the-sun for global teams -- hand off to the next timezone so no one is paged at 3 AM. (5) Compensate on-call (additional pay or time off). Uncompensated on-call leads to attrition. (6) Handoff briefing at rotation change: current issues, recent deployments, known risks. Tools: PagerDuty, Opsgenie, or Grafana OnCall manage rotations, escalation policies, and paging via phone, SMS, push, and Slack.

Question 2

What are incident severity levels and why do they matter?

Accepted Answer

Severity levels standardize response based on impact: SEV-1 (Critical): complete outage or data loss affecting all users. Page on-call immediately, open incident channel, communicate to leadership every 15 minutes. Target: resolve under 1 hour. SEV-2 (Major): partial outage or significant degradation. Page on-call, incident channel, updates every 30 minutes. Target: under 4 hours. SEV-3 (Minor): small number of users affected or non-critical feature. Create ticket, investigate during business hours. Target: 1 business day. SEV-4 (Low): cosmetic issues. Backlog ticket. Severity determines who is paged (SEV-1: on-call + manager + VP; SEV-2: on-call only; SEV-3: ticket), communication cadence, and escalation path. Document definitions clearly so the 3 AM on-call engineer can classify without judgment calls.

Question 3

What is a blameless postmortem and why is it important?

Accepted Answer

A blameless postmortem is a structured review after SEV-1/2 incidents focused on systems and processes, not individual mistakes. Template: summary, impact (duration, users affected, business cost), detailed timeline, root cause, contributing factors (process failures that amplified impact), what went well, and action items (specific, assigned, time-bound). Key principle: if an engineer made a mistake, ask what about the system allowed this mistake to cause an outage? The fix is in the system (better testing, better guards, automation), not in the person. Example: instead of Bob deployed a bad config, write deployment process allowed a config change without validation -- action: add schema validation to config deployment pipeline. Review postmortems in a team meeting. The goal is organizational learning. Blame-oriented cultures suppress incident reporting and prevent improvement.

Question 4

What should a good runbook contain?

Accepted Answer

A runbook is a step-by-step guide for diagnosing and mitigating a specific incident type. Template: (1) Alert name and description. (2) Impact -- what user-facing functionality is affected. (3) Quick check -- is this a known issue? Check incident channel and recent deployments. (4) Diagnosis steps with dashboard links: check error rate dashboard, check recent deployments, check database health, check upstream dependencies. Each step includes a direct link to the relevant tool. (5) Mitigation steps: if caused by deployment then rollback (exact command), if database overloaded then enable read-only mode, if upstream down then enable circuit breaker. (6) Escalation: if above steps fail, page team lead (name, PagerDuty handle). Maintenance: update after every incident where the runbook was missing or incorrect. Link every alert to its runbook in PagerDuty metadata. Stale runbooks waste time on wrong steps.

SRE: Incident Management and On-Call — Rotation, Runbooks, Postmortems, Blameless Culture, PagerDuty, Escalation

On-Call Rotation Design

Incident Severity Levels

Incident Response Workflow

Runbooks

Blameless Postmortems