Low Level Design: SLO, SLA, SLI, and Error Budget

Site Reliability Engineering (SRE) formalizes reliability using three measurements: Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs). Error budgets quantify how much unreliability is acceptable and drive decisions about feature velocity vs reliability investment. These concepts replace vague “five nines” aspirations with measurable, actionable reliability targets.

Service Level Indicator (SLI)

An SLI is a quantitative measure of a service characteristic, expressed as a ratio of good events to total events. Common SLIs: availability (successful HTTP requests / total requests), latency (requests completing under 200ms / total requests), throughput (processed messages per second), error rate (errors / total requests). SLIs must be measurable from data the service already emits (metrics, logs). Choose SLIs that capture what users actually experience, not just server-side metrics.

Service Level Objective (SLO)

An SLO is a target value for an SLI over a time window: “99.9% of requests complete successfully over a rolling 30-day window.” The SLO defines the reliability target and is the internal commitment the team makes to itself. SLOs should be tight enough to ensure users are happy but not so tight that meeting them requires heroics. Start by measuring current reliability, understand what customers actually need, and set the SLO to match the real customer requirement, not an aspirational number.

Service Level Agreement (SLA)

An SLA is a contract with customers that specifies the minimum reliability level and the consequences of missing it (service credits, refunds). SLAs are typically less strict than SLOs: set the SLA at 99.5% if the internal SLO is 99.9%, providing a buffer for measurement disagreements, brief incidents, and planned maintenance. The SLA is the external promise; the SLO is the internal target that provides confidence of meeting the SLA.

Error Budget

An error budget is the amount of unreliability the SLO permits. For a 99.9% SLO over 30 days: error budget = 30 days * (1 – 0.999) = 43.2 minutes of downtime (or equivalent request failures). When the error budget is full, the team can deploy risky changes freely. As the budget is consumed by incidents and deployments, the team slows feature velocity to protect reliability. When the budget is exhausted, no further risky changes deploy until the window resets.

Error Budget Policy

Document explicit policies for error budget consumption: if >50% of budget is consumed in the first half of the window, halt all non-critical deployments; if 100% is consumed, freeze all deployments for the remainder of the window; when budget is exhausted three consecutive windows, declare a reliability crisis requiring engineering team sprint exclusively on reliability improvements. The policy removes human judgment from deployment decisions under budget pressure — teams follow the policy, not manager pressure.

SLO Measurement and Alerting

Measure SLIs from production traffic. Alert on error budget burn rate, not on the raw error rate: a high burn rate (consuming 5% of budget per hour) will exhaust the 30-day budget in 20 hours. Alert when burn rate exceeds a threshold (e.g., 2x the sustainable rate) for a sustained period (5 minutes). This gives teams early warning to investigate before the budget is exhausted. Avoid alerting on instantaneous SLI dips — brief spikes are expected and don't meaningfully impact the window.

Setting Useful SLOs

Bad SLOs: aspirational (99.999% for a service that historically achieves 99.5%), arbitrary (five nines because it sounds good), unmeasured (no data to verify). Good SLOs: derived from user pain (run experiments to find the latency threshold where users abandon; set the SLO slightly below it), achievable with current architecture, measurable with existing data, customer-facing (not internal proxy metrics). Review SLOs quarterly: tighten when the service has headroom, relax when the SLO is causing operational toil without user benefit.

Scroll to Top