How an engineering manager handles production incidents shapes the team’s reliability culture for years. Incidents are also a reliable interview topic — “tell me about an outage you handled” or “how do you run a post-mortem culture?” Strong answers come from people who have actually done the work, not just read the SRE book.
The incident lifecycle
- Detection: alert fires or user report comes in
- Triage: assess severity, page the right people
- Mitigation: stop the bleeding
- Resolution: permanent fix
- Post-mortem: learn from the incident
- Prevention: changes to reduce recurrence
Severity definitions
A clear sev classification keeps the team aligned:
- Sev 1: Site is down or major feature unusable for many users
- Sev 2: Significant degradation; some users affected
- Sev 3: Minor degradation or partial outage
- Sev 4: No customer impact; needs follow-up
Document this. Apply consistently. Avoid sev escalation as social pressure.
Incident command structure
For Sev 1 / Sev 2 incidents, designate roles:
- Incident Commander (IC): coordinates response, makes decisions. Does not write code during the incident.
- Communications lead: updates stakeholders (status page, internal Slack, support team)
- Subject matter experts: engineers actually fixing
- Scribe: documents the timeline as it happens
The EM may take any of these roles depending on the team’s incident maturity.
The mitigation-first principle
During an incident, prioritize stopping the bleeding over finding the root cause. Often: roll back the recent deploy, switch traffic to a healthy region, disable the problematic feature. Diagnose afterward.
“Why did this happen?” is a post-incident question. “How do we make it stop now?” is the during-incident question.
Communication during incidents
- Update stakeholders every 15–30 minutes during active incidents
- Use a public-facing status page with appropriate detail
- Internal Slack channel dedicated to the incident
- Be honest — “we are still investigating” is better than vague reassurance
The blameless post-mortem
Within 48 hours of a Sev 1 / Sev 2 incident:
- Schedule a 60-minute post-mortem with the responders + EM + stakeholders
- Write the document beforehand
- Review the timeline, root cause, and contributing factors
- Identify action items with owners and dates
- Publish broadly
Blameless does not mean blame-free. It means focusing on systemic factors, not individuals. “The runbook was unclear” is constructive. “Sarah misread the runbook” is not.
Action items
Common categories:
- Detection improvements (better alerts)
- Mitigation improvements (faster rollback, better runbooks)
- Architectural improvements (reduce blast radius)
- Process improvements (better deploy gates, better on-call hand-off)
- Training improvements (game days, runbook reviews)
Track action items to completion. The most common failure is identifying them and not following through.
Alert fatigue
If your team is paged 3+ times per shift, you have alert fatigue. Solutions:
- Audit alerts quarterly — which fire and what is the action?
- Demote noisy alerts to non-paging notifications
- Add automatic remediation for common alerts (auto-restart, auto-scale)
- Set explicit goal: average alert volume should be N or fewer per shift
Game days
Periodically inject failures and practice the response. Builds team muscle, surfaces broken runbooks, validates alerting. 1–2 hours per quarter.
The EM’s role during an incident
- Be available; do not be in the way
- Help with stakeholder communications so engineers can fix
- Make resource decisions (call in additional people, escalate)
- Take notes so the post-mortem has a real timeline
If you are the only engineer who can fix it, your team has a single-point-of-failure problem to address.
Frequently Asked Questions
Should I be on the on-call rotation as an EM?
Some teams: yes (Stripe, Meta). Others: no, but you stay involved (FAANG senior+). Either way, you should be able to respond to a Sev 1 page personally if needed.
How do I balance new feature work with incident-driven work?
Allocate a fixed % of capacity (10–20%) to reliability work. Track incident-driven action items as first-class backlog items. If incident work crowds out feature work for more than 2 quarters, escalate.
What if my team’s reliability is bad and the company does not prioritize it?
Document the cost — engineer-hours spent on incidents, customer impact, on-call burnout. Bring data to your leadership. If still not prioritized, decide whether the role is right for you.