System Design: SLI, SLO, and SLA Deep Dive — Error Budgets, Reliability Targets, Monitoring, Alerting, Google SRE

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) form the foundation of modern reliability engineering. Google SRE pioneered this framework, and it has become the standard approach for managing reliability at scale. This guide covers how to define, measure, and operationalize SLIs/SLOs/SLAs — essential knowledge for SRE interviews and production operations.

SLI: What You Measure

A Service Level Indicator is a quantitative measure of a specific aspect of service quality. Good SLIs are expressed as a ratio: (good events / total events) * 100%. Common SLIs: (1) Availability — the proportion of requests that succeed. Good events: HTTP responses with status code != 5xx. Total events: all HTTP requests. Availability = (total_requests – 5xx_errors) / total_requests. (2) Latency — the proportion of requests faster than a threshold. Good events: requests with latency < 200ms. Latency SLI = requests_under_200ms / total_requests. (3) Throughput — requests served per second, measured at the load balancer. (4) Correctness — the proportion of responses that return the correct result (harder to measure, requires application-specific validation). (5) Freshness — the proportion of data reads that return data updated within the last N seconds. Important for eventual consistency systems. Choose SLIs that reflect the user experience, not internal system metrics. CPU utilization is not an SLI because users do not care about CPU — they care about whether the page loaded fast and correctly.

SLO: Your Reliability Target

A Service Level Objective is a target value for an SLI over a time window. Example: 99.9% of requests will have latency under 200ms, measured over a 30-day rolling window. Choosing the right SLO: (1) Start with user expectations. An internal tool can tolerate 99% availability (7.2 hours of downtime per month). A payment API needs 99.99% (4.3 minutes per month). (2) Consider dependencies. Your SLO cannot exceed your dependencies SLO. If your database is 99.95% available, your service cannot be 99.99% available. (3) Be conservative initially. Set a lower SLO and tighten it later. It is much harder to relax an SLO (users have already formed expectations) than to tighten one. SLO time windows: a 30-day rolling window is the most common. Calendar month windows create end-of-month pressure. Shorter windows (7 days) are more sensitive to brief incidents. Google recommendation: use a 30-day rolling window for operational SLOs and a calendar quarter for executive reporting.

SLA: The Business Contract

A Service Level Agreement is a contract between a service provider and customer that specifies consequences for missing SLOs. The SLA is always less strict than the internal SLO — you need a buffer. If your internal SLO is 99.95%, your SLA might promise 99.9%. This gives your team time to detect and fix issues before breaching the contractual commitment. SLA consequences: service credits (AWS gives 10% credit for availability below 99.99%, 25% below 99.0%), contractual penalties, or contract termination rights. SLA exclusions: planned maintenance windows, force majeure events, customer-caused issues. Not every service needs an SLA — internal services and free tiers typically have SLOs but no contractual SLAs. SLAs are negotiated by business teams with engineering input on what is achievable. Never commit to an SLA without data showing the service can consistently exceed the target.

Error Budgets: Making Reliability Decisions

An error budget is the inverse of the SLO: if the SLO is 99.9% availability over 30 days, the error budget is 0.1% = 43.2 minutes of downtime per month. The error budget is a budget to spend on risk-taking: deployments, experiments, and infrastructure changes. Error budget policy: (1) When the error budget is healthy (>50% remaining), prioritize feature velocity. Ship fast, take risks, deploy frequently. (2) When the error budget is low (<25% remaining), slow down. Require extra review for deployments, postpone risky changes, focus on reliability improvements. (3) When the error budget is exhausted (0% remaining), freeze all non-reliability deployments until the budget recovers. Only bug fixes and reliability improvements are deployed. This framework aligns incentives: product teams want to ship features (spend error budget), SRE teams want to maintain reliability (conserve error budget). The error budget is the shared currency. A team that deploys a buggy release and burns 50% of the error budget has objectively reduced the team capacity for future risk-taking.

SLO-Based Alerting

Traditional alerting fires on symptoms (CPU > 90%, error rate > 1%). SLO-based alerting fires on error budget burn rate — the rate at which the error budget is being consumed. Burn rate = actual error rate / tolerated error rate. If the SLO allows 0.1% errors over 30 days, the tolerated error rate is 0.1%. If the current error rate is 1%, the burn rate is 10x — the entire monthly error budget will be consumed in 3 days. Multi-window alerting (Google recommendation): (1) Page the on-call if the 1-hour burn rate > 14.4x (budget consumed in 2 hours at this rate — an acute incident). (2) Create a ticket if the 6-hour burn rate > 6x (budget consumed in 5 days — a slower degradation). (3) Weekly review if the 3-day burn rate > 1x (budget being consumed faster than it replenishes). Benefits: fewer false alerts (a brief error spike that does not threaten the monthly budget does not page anyone), and alerts are directly tied to user impact (the error budget represents the acceptable impact on users).

Implementing SLOs in Practice

Step-by-step implementation: (1) Instrument your service to emit SLI data. For availability: count total requests and 5xx errors at the load balancer or service mesh (Istio, Envoy). For latency: record request duration histograms in Prometheus. (2) Define SLOs in configuration (not code). Use a tool like Sloth (generates Prometheus recording rules from SLO definitions) or Google Cloud SLO Monitoring. (3) Build an SLO dashboard: current SLI value, error budget remaining (absolute and percentage), error budget burn rate, and a graph of SLI over the last 30 days. (4) Set up burn-rate alerts as described above. (5) Establish an error budget policy document that the team agrees to follow. (6) Review SLOs quarterly: are they too tight (team is constantly in budget freeze), too loose (users are complaining despite meeting the SLO), or just right? Adjust based on data and user feedback. Common mistake: setting SLOs without measuring first. Measure the current SLI for 30 days before setting a target — you need a baseline.

Scroll to Top