Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) form the foundation of modern reliability engineering. Google SRE pioneered this framework, and it has become the standard approach for managing reliability at scale. This guide covers how to define, measure, and operationalize SLIs/SLOs/SLAs — essential knowledge for SRE interviews and production operations.
SLI: What You Measure
A Service Level Indicator is a quantitative measure of a specific aspect of service quality. Good SLIs are expressed as a ratio: (good events / total events) * 100%. Common SLIs: (1) Availability — the proportion of requests that succeed. Good events: HTTP responses with status code != 5xx. Total events: all HTTP requests. Availability = (total_requests – 5xx_errors) / total_requests. (2) Latency — the proportion of requests faster than a threshold. Good events: requests with latency < 200ms. Latency SLI = requests_under_200ms / total_requests. (3) Throughput — requests served per second, measured at the load balancer. (4) Correctness — the proportion of responses that return the correct result (harder to measure, requires application-specific validation). (5) Freshness — the proportion of data reads that return data updated within the last N seconds. Important for eventual consistency systems. Choose SLIs that reflect the user experience, not internal system metrics. CPU utilization is not an SLI because users do not care about CPU — they care about whether the page loaded fast and correctly.
SLO: Your Reliability Target
A Service Level Objective is a target value for an SLI over a time window. Example: 99.9% of requests will have latency under 200ms, measured over a 30-day rolling window. Choosing the right SLO: (1) Start with user expectations. An internal tool can tolerate 99% availability (7.2 hours of downtime per month). A payment API needs 99.99% (4.3 minutes per month). (2) Consider dependencies. Your SLO cannot exceed your dependencies SLO. If your database is 99.95% available, your service cannot be 99.99% available. (3) Be conservative initially. Set a lower SLO and tighten it later. It is much harder to relax an SLO (users have already formed expectations) than to tighten one. SLO time windows: a 30-day rolling window is the most common. Calendar month windows create end-of-month pressure. Shorter windows (7 days) are more sensitive to brief incidents. Google recommendation: use a 30-day rolling window for operational SLOs and a calendar quarter for executive reporting.
SLA: The Business Contract
A Service Level Agreement is a contract between a service provider and customer that specifies consequences for missing SLOs. The SLA is always less strict than the internal SLO — you need a buffer. If your internal SLO is 99.95%, your SLA might promise 99.9%. This gives your team time to detect and fix issues before breaching the contractual commitment. SLA consequences: service credits (AWS gives 10% credit for availability below 99.99%, 25% below 99.0%), contractual penalties, or contract termination rights. SLA exclusions: planned maintenance windows, force majeure events, customer-caused issues. Not every service needs an SLA — internal services and free tiers typically have SLOs but no contractual SLAs. SLAs are negotiated by business teams with engineering input on what is achievable. Never commit to an SLA without data showing the service can consistently exceed the target.
Error Budgets: Making Reliability Decisions
An error budget is the inverse of the SLO: if the SLO is 99.9% availability over 30 days, the error budget is 0.1% = 43.2 minutes of downtime per month. The error budget is a budget to spend on risk-taking: deployments, experiments, and infrastructure changes. Error budget policy: (1) When the error budget is healthy (>50% remaining), prioritize feature velocity. Ship fast, take risks, deploy frequently. (2) When the error budget is low (<25% remaining), slow down. Require extra review for deployments, postpone risky changes, focus on reliability improvements. (3) When the error budget is exhausted (0% remaining), freeze all non-reliability deployments until the budget recovers. Only bug fixes and reliability improvements are deployed. This framework aligns incentives: product teams want to ship features (spend error budget), SRE teams want to maintain reliability (conserve error budget). The error budget is the shared currency. A team that deploys a buggy release and burns 50% of the error budget has objectively reduced the team capacity for future risk-taking.
SLO-Based Alerting
Traditional alerting fires on symptoms (CPU > 90%, error rate > 1%). SLO-based alerting fires on error budget burn rate — the rate at which the error budget is being consumed. Burn rate = actual error rate / tolerated error rate. If the SLO allows 0.1% errors over 30 days, the tolerated error rate is 0.1%. If the current error rate is 1%, the burn rate is 10x — the entire monthly error budget will be consumed in 3 days. Multi-window alerting (Google recommendation): (1) Page the on-call if the 1-hour burn rate > 14.4x (budget consumed in 2 hours at this rate — an acute incident). (2) Create a ticket if the 6-hour burn rate > 6x (budget consumed in 5 days — a slower degradation). (3) Weekly review if the 3-day burn rate > 1x (budget being consumed faster than it replenishes). Benefits: fewer false alerts (a brief error spike that does not threaten the monthly budget does not page anyone), and alerts are directly tied to user impact (the error budget represents the acceptable impact on users).
Implementing SLOs in Practice
Step-by-step implementation: (1) Instrument your service to emit SLI data. For availability: count total requests and 5xx errors at the load balancer or service mesh (Istio, Envoy). For latency: record request duration histograms in Prometheus. (2) Define SLOs in configuration (not code). Use a tool like Sloth (generates Prometheus recording rules from SLO definitions) or Google Cloud SLO Monitoring. (3) Build an SLO dashboard: current SLI value, error budget remaining (absolute and percentage), error budget burn rate, and a graph of SLI over the last 30 days. (4) Set up burn-rate alerts as described above. (5) Establish an error budget policy document that the team agrees to follow. (6) Review SLOs quarterly: are they too tight (team is constantly in budget freeze), too loose (users are complaining despite meeting the SLO), or just right? Adjust based on data and user feedback. Common mistake: setting SLOs without measuring first. Measure the current SLI for 30 days before setting a target — you need a baseline.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is an error budget and how does it balance reliability with feature velocity?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “An error budget is the maximum amount of unreliability allowed by the SLO. If the SLO is 99.9% availability over 30 days, the error budget is 0.1% of total requests or approximately 43.2 minutes of downtime. The error budget creates a shared framework between product and SRE teams. When the budget is healthy (more than 50% remaining), the team prioritizes feature development and accepts deployment risk. When the budget is depleted, the team freezes feature deployments and focuses exclusively on reliability improvements. This prevents the common conflict where product teams want to ship fast and SRE teams want to slow down for stability. Both teams agree on the SLO upfront, and the error budget is the objective measure of whether the service can afford more risk. Practical policy: above 75% budget remaining — deploy freely, run chaos experiments, take risks. Between 25-75% — normal deployment cadence with standard review. Below 25% — require additional review for deployments, postpone risky changes. At 0% — only reliability fixes and rollbacks are deployed until the budget recovers. The budget resets with the SLO window (rolling 30 days).” } }, { “@type”: “Question”, “name”: “How does SLO-based alerting reduce alert fatigue compared to threshold-based alerting?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Threshold-based alerting fires when a metric crosses a static value: alert if error rate > 1% for 5 minutes. Problem: a brief 2% error spike that lasts 3 minutes and recovers is not a threat to the monthly SLO (it consumed only 0.007% of the error budget) but fires the alert and pages the on-call engineer. Over time, frequent non-actionable alerts cause alert fatigue — engineers ignore or auto-acknowledge alerts, and miss real incidents. SLO-based alerting fires based on error budget burn rate — how quickly the error budget is being consumed relative to the SLO window. Burn rate = (actual error rate) / (maximum tolerated error rate). A burn rate of 1x means the budget will be exactly consumed by the end of the window — sustainable. A burn rate of 10x means the budget will be consumed in 3 days — urgent. Multi-window approach: page the on-call only when a fast burn rate (14.4x over 1 hour) threatens to exhaust the budget within hours. Create a ticket for a slow burn rate (6x over 6 hours) that threatens the budget within days. This dramatically reduces pages — the brief error spike that recovered has a negligible burn rate and does not alert.” } }, { “@type”: “Question”, “name”: “How do you choose the right SLO target for a service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Choosing the right SLO requires balancing user expectations, engineering cost, and business requirements. Steps: (1) Measure the current SLI for 30 days before setting a target. If the service currently achieves 99.95% availability, setting a 99.99% SLO creates immediate budget pressure. Start at or slightly below the current performance. (2) Consider user expectations. Internal tools can tolerate lower availability (99%) than customer-facing APIs (99.9%+). Payment processing requires higher reliability than a recommendation engine. (3) Consider dependencies. Your SLO cannot meaningfully exceed your weakest dependency SLO. If your database provider offers 99.95% availability, your service SLO of 99.99% is aspirational at best. (4) Consider the cost of additional nines. Each additional nine of availability roughly doubles or triples the engineering effort: 99% to 99.9% requires redundancy and failover. 99.9% to 99.99% requires multi-region deployment. 99.99% to 99.999% requires active-active with automatic failover and extensive chaos testing. (5) Set the SLO conservatively and tighten over time. It is much easier to tighten an SLO (users are happy) than to relax one (users are disappointed). Review and adjust quarterly based on actual performance and user feedback.” } }, { “@type”: “Question”, “name”: “What is the difference between an SLO and an SLA?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “An SLO (Service Level Objective) is an internal reliability target set by the engineering team. It defines the desired level of service quality. There are no contractual consequences for missing an SLO — the consequence is internal: error budget depletion triggers a reliability-focused response (deployment freeze, incident review, engineering investment). An SLA (Service Level Agreement) is an external contract between a service provider and its customers. It specifies the minimum acceptable service level and the consequences (financial penalties, service credits, contract termination) if the provider fails to meet it. The SLA is always less strict than the SLO. If your internal SLO is 99.95% availability, your SLA might promise 99.9%. This gap is your safety margin — you can miss your internal target without breaching the customer contract, giving your team time to detect and fix issues. Not every service has an SLA — free tiers, internal services, and early-stage products typically have SLOs but no contractual SLAs. SLAs are defined by business and legal teams with engineering input on what is technically achievable. Never agree to an SLA without historical SLI data showing the service consistently exceeds the target.” } } ] }