Production Readiness Reviews: An EM’s Guide

⏱ 2 min read

A production readiness review (PRR) is a structured assessment before launching a new service or major feature. The discipline catches the boring-but-fatal issues that a launch dashboard does not: monitoring, runbooks, on-call coverage, capacity planning, security review. Senior EM interviews increasingly probe whether you have run real PRRs.

What a PRR covers

The standard categories:

Monitoring and alerting
Logging and tracing
Capacity and scaling
Failure modes and recovery
Security and compliance
Operational ownership
Documentation and runbooks
Rollout strategy

The checklist

Monitoring

Latency dashboards (p50, p99, p99.9)
Error rate dashboards
Saturation and capacity dashboards
SLOs defined for the service
Alerts paging the right people for SLO violations

Logging and tracing

Structured logs at appropriate levels
Request IDs propagated end-to-end
Distributed tracing instrumentation
Sensitive data scrubbing in logs

Capacity

Load testing performed
Headroom over expected peak (typically 2x)
Auto-scaling configured
Cost projection

Failure modes

Documented for each dependency: what happens when it fails?
Circuit breakers, timeouts, retries set sensibly
Dead-letter queues for async work
Rollback procedure tested

Security

Threat model exists
Auth and authz reviewed
Secrets managed properly
Dependencies scanned
Compliance requirements (SOC2, HIPAA, etc.) addressed

Ownership

On-call rotation defined
Pager-duty escalation policy
Team understands the service well enough to debug
Service has an owner team in your service catalog

Documentation

Architecture overview
Operational runbook (common alerts and what to do)
Incident response playbook
Customer-facing docs

Rollout

Feature flag controls launch
Canary or percentage-based rollout plan
Rollback plan
Communication plan if user-visible

Who runs the PRR

Common patterns:

Self-service: the team fills out the checklist; the EM signs off
Peer review: another team’s engineer reviews the readiness doc
SRE-led: dedicated SRE or platform team reviews
Hybrid: all of the above for high-risk services

For genuinely critical launches, peer or SRE review is worth the time. For minor features, self-attestation is fine.

The “graduated PRR” model

Different services need different rigor. Tier them:

Tier 1 (critical): full PRR, multiple reviewers, signed off by VP
Tier 2 (important): standard PRR, peer review
Tier 3 (low risk): self-attestation

Helps balance ceremony with velocity.

Common gaps

Monitoring exists but no SLOs are defined
Alerts fire but no runbook tells the responder what to do
Rollback “plan” is “redeploy the old version” without a tested procedure
Canary rollout assumes traffic shaping that does not actually exist
Capacity assumed; no load test performed

The post-launch review

After 30 days:

How accurate were our capacity estimates?
What alerts fired? Were they actionable?
Did we have any incidents we did not anticipate?
Are runbooks accurate?

Update the PRR template based on what you learn. Each post-launch review improves the next.

Frequently Asked Questions

Does PRR slow launches?

Done well, no — most of the work is what you should be doing anyway. Done poorly, yes — gates without value frustrate teams.

What if leadership wants to skip PRR for speed?

Document the risks being accepted. Sometimes the right answer is “skip and accept the risk.” Sometimes the right answer is “no — launching without monitoring will hurt more than waiting two days.”

How does PRR differ from a design review?

Design review: is this the right approach? PRR: is this ready for production? Different stages, different questions.