Question 1

What is an error budget and how does it balance reliability with feature velocity?

Accepted Answer

An error budget is the maximum amount of unreliability allowed by the SLO. If the SLO is 99.9% availability over 30 days, the error budget is 0.1% of total requests or approximately 43.2 minutes of downtime. The error budget creates a shared framework between product and SRE teams. When the budget is healthy (more than 50% remaining), the team prioritizes feature development and accepts deployment risk. When the budget is depleted, the team freezes feature deployments and focuses exclusively on reliability improvements. This prevents the common conflict where product teams want to ship fast and SRE teams want to slow down for stability. Both teams agree on the SLO upfront, and the error budget is the objective measure of whether the service can afford more risk. Practical policy: above 75% budget remaining -- deploy freely, run chaos experiments, take risks. Between 25-75% -- normal deployment cadence with standard review. Below 25% -- require additional review for deployments, postpone risky changes. At 0% -- only reliability fixes and rollbacks are deployed until the budget recovers. The budget resets with the SLO window (rolling 30 days).

Question 2

How does SLO-based alerting reduce alert fatigue compared to threshold-based alerting?

Accepted Answer

Threshold-based alerting fires when a metric crosses a static value: alert if error rate > 1% for 5 minutes. Problem: a brief 2% error spike that lasts 3 minutes and recovers is not a threat to the monthly SLO (it consumed only 0.007% of the error budget) but fires the alert and pages the on-call engineer. Over time, frequent non-actionable alerts cause alert fatigue -- engineers ignore or auto-acknowledge alerts, and miss real incidents. SLO-based alerting fires based on error budget burn rate -- how quickly the error budget is being consumed relative to the SLO window. Burn rate = (actual error rate) / (maximum tolerated error rate). A burn rate of 1x means the budget will be exactly consumed by the end of the window -- sustainable. A burn rate of 10x means the budget will be consumed in 3 days -- urgent. Multi-window approach: page the on-call only when a fast burn rate (14.4x over 1 hour) threatens to exhaust the budget within hours. Create a ticket for a slow burn rate (6x over 6 hours) that threatens the budget within days. This dramatically reduces pages -- the brief error spike that recovered has a negligible burn rate and does not alert.

Question 3

How do you choose the right SLO target for a service?

Accepted Answer

Choosing the right SLO requires balancing user expectations, engineering cost, and business requirements. Steps: (1) Measure the current SLI for 30 days before setting a target. If the service currently achieves 99.95% availability, setting a 99.99% SLO creates immediate budget pressure. Start at or slightly below the current performance. (2) Consider user expectations. Internal tools can tolerate lower availability (99%) than customer-facing APIs (99.9%+). Payment processing requires higher reliability than a recommendation engine. (3) Consider dependencies. Your SLO cannot meaningfully exceed your weakest dependency SLO. If your database provider offers 99.95% availability, your service SLO of 99.99% is aspirational at best. (4) Consider the cost of additional nines. Each additional nine of availability roughly doubles or triples the engineering effort: 99% to 99.9% requires redundancy and failover. 99.9% to 99.99% requires multi-region deployment. 99.99% to 99.999% requires active-active with automatic failover and extensive chaos testing. (5) Set the SLO conservatively and tighten over time. It is much easier to tighten an SLO (users are happy) than to relax one (users are disappointed). Review and adjust quarterly based on actual performance and user feedback.

Question 4

What is the difference between an SLO and an SLA?

Accepted Answer

An SLO (Service Level Objective) is an internal reliability target set by the engineering team. It defines the desired level of service quality. There are no contractual consequences for missing an SLO -- the consequence is internal: error budget depletion triggers a reliability-focused response (deployment freeze, incident review, engineering investment). An SLA (Service Level Agreement) is an external contract between a service provider and its customers. It specifies the minimum acceptable service level and the consequences (financial penalties, service credits, contract termination) if the provider fails to meet it. The SLA is always less strict than the SLO. If your internal SLO is 99.95% availability, your SLA might promise 99.9%. This gap is your safety margin -- you can miss your internal target without breaching the customer contract, giving your team time to detect and fix issues. Not every service has an SLA -- free tiers, internal services, and early-stage products typically have SLOs but no contractual SLAs. SLAs are defined by business and legal teams with engineering input on what is technically achievable. Never agree to an SLA without historical SLI data showing the service consistently exceeds the target.

System Design: SLI, SLO, and SLA Deep Dive — Error Budgets, Reliability Targets, Monitoring, Alerting, Google SRE

SLI: What You Measure

SLO: Your Reliability Target

SLA: The Business Contract

Error Budgets: Making Reliability Decisions

SLO-Based Alerting

Implementing SLOs in Practice