SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets are the quantitative framework for reliability engineering. An SLI measures a specific aspect of service behavior (latency, availability, error rate). An SLO sets a target for the SLI (99.9% of requests complete in under 200ms). The error budget is the allowed failure margin (0.1% of requests may exceed 200ms). This framework aligns engineering decisions with reliability goals and creates data-driven conversations about risk.
Choosing SLIs
Good SLIs measure what users experience, not internal server metrics. For a request-response service: availability SLI = successful requests / total requests; latency SLI = proportion of requests completed under threshold (e.g., 200ms at p99); error rate SLI = error responses / total requests. For a data pipeline: freshness SLI = proportion of time the output dataset is within N minutes of the input; completeness SLI = rows successfully processed / rows received. Avoid vanity metrics (CPU utilization, memory usage) that don’t directly reflect user experience.
Setting SLO Targets
SLO targets are set based on user expectations and measurement of actual behavior. Start with a retrospective: measure your current SLI over the past 30 days. Set the initial SLO slightly below current performance — if you’re achieving 99.95% availability, set the SLO at 99.9%. This gives an error budget without requiring perfection. Avoid aspirational SLOs that current architecture cannot meet — they burn the error budget immediately and create alert fatigue. SLOs are not maximized: a 100% availability SLO prevents any risk-taking and makes reliability the only engineering priority.
Error Budget Calculation
Error budget = 1 – SLO. For a 99.9% availability SLO over 30 days: error budget = 0.1% of 30 days = 43.2 minutes of downtime or 0.1% of requests failing. Track budget consumption in real time. When 50% of the error budget is consumed in the first 15 days of the month, the service is on track to exhaust its budget — this is a signal to slow feature releases and focus on reliability. When the budget is exhausted, stop all feature work and treat reliability as the only priority until the budget resets at the next period.
Measuring SLIs Correctly
Measure SLIs from the user’s perspective, not from internal metrics. Options: server-side metrics (HTTP status codes from the load balancer — miss client-side failures), synthetic monitoring (scripted user flows that test the service end-to-end at regular intervals from external locations), and real user monitoring (RUM — JavaScript in the browser measures actual user-experienced latency). Load balancer metrics are the most common starting point; add synthetic monitoring for external health checks and client-side RUM for frontend services. The measurement point matters: measure at the API gateway or load balancer, not at the individual service instance.
Multi-Window Alerting
Page on burn rate, not on threshold violations. Burn rate = current error rate / error budget rate. A burn rate of 1 means you’re consuming the budget at exactly the rate that would exhaust it in the SLO window. A burn rate of 14.4 means you’re consuming budget 14.4x faster — exhausting the monthly budget in 2 days. Alert when burn rate is high for a sustained period: alert if burn rate > 14.4 for 1 hour (exhausts budget in 2 days) AND burn rate > 6 for 6 hours (exhausts in 5 days). The dual-window approach reduces false positives from short spikes while catching sustained issues quickly.
SLOs for Different Service Types
Tier 1 (customer-facing, revenue-impacting): 99.9% availability SLO, p99 latency SLO. Tier 2 (internal tools, developer productivity): 99.5% availability SLO. Tier 3 (batch jobs, analytics): freshness SLO (output available within 1 hour of schedule). Not all services deserve the same investment in reliability. Publishing explicit SLO tiers clarifies what level of reliability each service provides to its callers — callers can design their own resilience (fallbacks, degraded modes) based on their dependency’s SLO. A service that claims 99.9% but delivers 95% has an incorrect SLO, not a reliability problem alone.
Error Budget Policy
An error budget policy defines how teams respond to budget consumption. Document in writing: when the monthly budget is > 50% consumed in < 50% of the month, the team must hold a reliability review before the next feature release. When the budget is exhausted, all feature work stops and the team focuses exclusively on reliability improvements until the budget resets. When budget is consistently unspent (the service is more reliable than the SLO requires), consider relaxing the SLO or investing the engineering time elsewhere. The policy ensures error budgets drive real decisions, not just dashboards.