What is the difference between an SLI, SLO, and SLA?

SLI (Service Level Indicator): a quantitative measurement of service behavior — e.g., the proportion of HTTP requests that return a 2xx response within 200ms. SLO (Service Level Objective): the internal target for an SLI — e.g., 99.9% of requests must meet the latency SLI. SLOs are internal reliability goals used by engineering teams. SLA (Service Level Agreement): a contractual commitment to customers, typically with financial penalties for violation. SLOs are set conservatively below SLAs to leave an engineering margin — if the SLA is 99.9%, set the internal SLO at 99.95%.

How is an error budget calculated and what does it represent?

Error budget = 1 - SLO. For a 99.9% availability SLO over 30 days: error budget = 0.1% of 43,200 minutes = 43.2 minutes of downtime allowed per month. The error budget represents how much unreliability the service can accumulate before violating its SLO. It is consumed by incidents, planned maintenance, and risky deployments. When the budget is exhausted, engineering focus shifts entirely to reliability improvements. An unspent budget signals that the team has headroom to take on feature risk.

What is burn rate alerting and why is it better than threshold alerting?

Burn rate = current error rate / error budget rate. A burn rate of 1 means the budget is consumed at the rate that exhausts it exactly in the SLO window. Threshold alerting (alert when error rate > 1%) misses gradual degradation and generates alerts that may not be actionable. Burn rate alerting pages when the budget is being consumed unsustainably — e.g., alert if burn rate > 14.4 for 1 hour (exhausts monthly budget in 2 days). Multi-window burn rate (alert if burn rate > 14.4 for 1h AND > 6 for 6h) reduces false positives from short spikes while catching sustained issues.

Why should SLO targets not be set at 100%?

A 100% SLO leaves zero error budget, making reliability the only engineering priority. No feature development, no risky deployments, no maintenance windows are possible without violating the SLO. Additionally, 100% is unachievable: hardware fails, networks partition, software has bugs. A realistic SLO (99.9% or 99.99% depending on the service tier) creates an error budget that the team can spend on feature work, taking calculated risks. The error budget creates a productive tension: reliability and feature velocity are both explicit, quantified goals.

Low Level Design: SLI, SLO, and Error Budget Design

⏱ 5 min read

SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets are the quantitative framework for reliability engineering. An SLI measures a specific aspect of service behavior (latency, availability, error rate). An SLO sets a target for the SLI (99.9% of requests complete in under 200ms). The error budget is the allowed failure margin (0.1% of requests may exceed 200ms). This framework aligns engineering decisions with reliability goals and creates data-driven conversations about risk.

Choosing SLIs

Good SLIs measure what users experience, not internal server metrics. For a request-response service: availability SLI = successful requests / total requests; latency SLI = proportion of requests completed under threshold (e.g., 200ms at p99); error rate SLI = error responses / total requests. For a data pipeline: freshness SLI = proportion of time the output dataset is within N minutes of the input; completeness SLI = rows successfully processed / rows received. Avoid vanity metrics (CPU utilization, memory usage) that don’t directly reflect user experience.

Setting SLO Targets

SLO targets are set based on user expectations and measurement of actual behavior. Start with a retrospective: measure your current SLI over the past 30 days. Set the initial SLO slightly below current performance — if you’re achieving 99.95% availability, set the SLO at 99.9%. This gives an error budget without requiring perfection. Avoid aspirational SLOs that current architecture cannot meet — they burn the error budget immediately and create alert fatigue. SLOs are not maximized: a 100% availability SLO prevents any risk-taking and makes reliability the only engineering priority.

Error Budget Calculation

Error budget = 1 – SLO. For a 99.9% availability SLO over 30 days: error budget = 0.1% of 30 days = 43.2 minutes of downtime or 0.1% of requests failing. Track budget consumption in real time. When 50% of the error budget is consumed in the first 15 days of the month, the service is on track to exhaust its budget — this is a signal to slow feature releases and focus on reliability. When the budget is exhausted, stop all feature work and treat reliability as the only priority until the budget resets at the next period.

Measuring SLIs Correctly

Measure SLIs from the user’s perspective, not from internal metrics. Options: server-side metrics (HTTP status codes from the load balancer — miss client-side failures), synthetic monitoring (scripted user flows that test the service end-to-end at regular intervals from external locations), and real user monitoring (RUM — JavaScript in the browser measures actual user-experienced latency). Load balancer metrics are the most common starting point; add synthetic monitoring for external health checks and client-side RUM for frontend services. The measurement point matters: measure at the API gateway or load balancer, not at the individual service instance.

Multi-Window Alerting

Page on burn rate, not on threshold violations. Burn rate = current error rate / error budget rate. A burn rate of 1 means you’re consuming the budget at exactly the rate that would exhaust it in the SLO window. A burn rate of 14.4 means you’re consuming budget 14.4x faster — exhausting the monthly budget in 2 days. Alert when burn rate is high for a sustained period: alert if burn rate > 14.4 for 1 hour (exhausts budget in 2 days) AND burn rate > 6 for 6 hours (exhausts in 5 days). The dual-window approach reduces false positives from short spikes while catching sustained issues quickly.

SLOs for Different Service Types

Tier 1 (customer-facing, revenue-impacting): 99.9% availability SLO, p99 latency SLO. Tier 2 (internal tools, developer productivity): 99.5% availability SLO. Tier 3 (batch jobs, analytics): freshness SLO (output available within 1 hour of schedule). Not all services deserve the same investment in reliability. Publishing explicit SLO tiers clarifies what level of reliability each service provides to its callers — callers can design their own resilience (fallbacks, degraded modes) based on their dependency’s SLO. A service that claims 99.9% but delivers 95% has an incorrect SLO, not a reliability problem alone.

Error Budget Policy

An error budget policy defines how teams respond to budget consumption. Document in writing: when the monthly budget is > 50% consumed in < 50% of the month, the team must hold a reliability review before the next feature release. When the budget is exhausted, all feature work stops and the team focuses exclusively on reliability improvements until the budget resets. When budget is consistently unspent (the service is more reliable than the SLO requires), consider relaxing the SLO or investing the engineering time elsewhere. The policy ensures error budgets drive real decisions, not just dashboards.