Production Readiness Reviews: An EM’s Guide

A production readiness review (PRR) is a structured assessment before launching a new service or major feature. The discipline catches the boring-but-fatal issues that a launch dashboard does not: monitoring, runbooks, on-call coverage, capacity planning, security review. Senior EM interviews increasingly probe whether you have run real PRRs.

What a PRR covers

The standard categories:

  1. Monitoring and alerting
  2. Logging and tracing
  3. Capacity and scaling
  4. Failure modes and recovery
  5. Security and compliance
  6. Operational ownership
  7. Documentation and runbooks
  8. Rollout strategy

The checklist

Monitoring

  • Latency dashboards (p50, p99, p99.9)
  • Error rate dashboards
  • Saturation and capacity dashboards
  • SLOs defined for the service
  • Alerts paging the right people for SLO violations

Logging and tracing

  • Structured logs at appropriate levels
  • Request IDs propagated end-to-end
  • Distributed tracing instrumentation
  • Sensitive data scrubbing in logs

Capacity

  • Load testing performed
  • Headroom over expected peak (typically 2x)
  • Auto-scaling configured
  • Cost projection

Failure modes

  • Documented for each dependency: what happens when it fails?
  • Circuit breakers, timeouts, retries set sensibly
  • Dead-letter queues for async work
  • Rollback procedure tested

Security

  • Threat model exists
  • Auth and authz reviewed
  • Secrets managed properly
  • Dependencies scanned
  • Compliance requirements (SOC2, HIPAA, etc.) addressed

Ownership

  • On-call rotation defined
  • Pager-duty escalation policy
  • Team understands the service well enough to debug
  • Service has an owner team in your service catalog

Documentation

  • Architecture overview
  • Operational runbook (common alerts and what to do)
  • Incident response playbook
  • Customer-facing docs

Rollout

  • Feature flag controls launch
  • Canary or percentage-based rollout plan
  • Rollback plan
  • Communication plan if user-visible

Who runs the PRR

Common patterns:

  • Self-service: the team fills out the checklist; the EM signs off
  • Peer review: another team’s engineer reviews the readiness doc
  • SRE-led: dedicated SRE or platform team reviews
  • Hybrid: all of the above for high-risk services

For genuinely critical launches, peer or SRE review is worth the time. For minor features, self-attestation is fine.

The “graduated PRR” model

Different services need different rigor. Tier them:

  • Tier 1 (critical): full PRR, multiple reviewers, signed off by VP
  • Tier 2 (important): standard PRR, peer review
  • Tier 3 (low risk): self-attestation

Helps balance ceremony with velocity.

Common gaps

  • Monitoring exists but no SLOs are defined
  • Alerts fire but no runbook tells the responder what to do
  • Rollback “plan” is “redeploy the old version” without a tested procedure
  • Canary rollout assumes traffic shaping that does not actually exist
  • Capacity assumed; no load test performed

The post-launch review

After 30 days:

  • How accurate were our capacity estimates?
  • What alerts fired? Were they actionable?
  • Did we have any incidents we did not anticipate?
  • Are runbooks accurate?

Update the PRR template based on what you learn. Each post-launch review improves the next.

Frequently Asked Questions

Does PRR slow launches?

Done well, no — most of the work is what you should be doing anyway. Done poorly, yes — gates without value frustrate teams.

What if leadership wants to skip PRR for speed?

Document the risks being accepted. Sometimes the right answer is “skip and accept the risk.” Sometimes the right answer is “no — launching without monitoring will hurt more than waiting two days.”

How does PRR differ from a design review?

Design review: is this the right approach? PRR: is this ready for production? Different stages, different questions.

Scroll to Top