Production Incident Management: An EM’s Playbook

⏱ 3 min read

How an engineering manager handles production incidents shapes the team’s reliability culture for years. Incidents are also a reliable interview topic — “tell me about an outage you handled” or “how do you run a post-mortem culture?” Strong answers come from people who have actually done the work, not just read the SRE book.

The incident lifecycle

Detection: alert fires or user report comes in
Triage: assess severity, page the right people
Mitigation: stop the bleeding
Resolution: permanent fix
Post-mortem: learn from the incident
Prevention: changes to reduce recurrence

Severity definitions

A clear sev classification keeps the team aligned:

Sev 1: Site is down or major feature unusable for many users
Sev 2: Significant degradation; some users affected
Sev 3: Minor degradation or partial outage
Sev 4: No customer impact; needs follow-up

Document this. Apply consistently. Avoid sev escalation as social pressure.

Incident command structure

For Sev 1 / Sev 2 incidents, designate roles:

Incident Commander (IC): coordinates response, makes decisions. Does not write code during the incident.
Communications lead: updates stakeholders (status page, internal Slack, support team)
Subject matter experts: engineers actually fixing
Scribe: documents the timeline as it happens

The EM may take any of these roles depending on the team’s incident maturity.

The mitigation-first principle

During an incident, prioritize stopping the bleeding over finding the root cause. Often: roll back the recent deploy, switch traffic to a healthy region, disable the problematic feature. Diagnose afterward.

“Why did this happen?” is a post-incident question. “How do we make it stop now?” is the during-incident question.

Communication during incidents

Update stakeholders every 15–30 minutes during active incidents
Use a public-facing status page with appropriate detail
Internal Slack channel dedicated to the incident
Be honest — “we are still investigating” is better than vague reassurance

The blameless post-mortem

Within 48 hours of a Sev 1 / Sev 2 incident:

Schedule a 60-minute post-mortem with the responders + EM + stakeholders
Write the document beforehand
Review the timeline, root cause, and contributing factors
Identify action items with owners and dates
Publish broadly

Blameless does not mean blame-free. It means focusing on systemic factors, not individuals. “The runbook was unclear” is constructive. “Sarah misread the runbook” is not.

Action items

Common categories:

Detection improvements (better alerts)
Mitigation improvements (faster rollback, better runbooks)
Architectural improvements (reduce blast radius)
Process improvements (better deploy gates, better on-call hand-off)
Training improvements (game days, runbook reviews)

Track action items to completion. The most common failure is identifying them and not following through.

Alert fatigue

If your team is paged 3+ times per shift, you have alert fatigue. Solutions:

Audit alerts quarterly — which fire and what is the action?
Demote noisy alerts to non-paging notifications
Add automatic remediation for common alerts (auto-restart, auto-scale)
Set explicit goal: average alert volume should be N or fewer per shift

Game days

Periodically inject failures and practice the response. Builds team muscle, surfaces broken runbooks, validates alerting. 1–2 hours per quarter.

The EM’s role during an incident

Be available; do not be in the way
Help with stakeholder communications so engineers can fix
Make resource decisions (call in additional people, escalate)
Take notes so the post-mortem has a real timeline

If you are the only engineer who can fix it, your team has a single-point-of-failure problem to address.

Frequently Asked Questions

Should I be on the on-call rotation as an EM?

Some teams: yes (Stripe, Meta). Others: no, but you stay involved (FAANG senior+). Either way, you should be able to respond to a Sev 1 page personally if needed.

How do I balance new feature work with incident-driven work?

Allocate a fixed % of capacity (10–20%) to reliability work. Track incident-driven action items as first-class backlog items. If incident work crowds out feature work for more than 2 quarters, escalate.

What if my team’s reliability is bad and the company does not prioritize it?

Document the cost — engineer-hours spent on incidents, customer impact, on-call burnout. Bring data to your leadership. If still not prioritized, decide whether the role is right for you.