Question 1

What is the difference between SLI, SLO, and SLA?

Accepted Answer

SLI (Service Level Indicator): a measured ratio of good events to total events, e.g., 99.95% of requests succeed. SLO (Service Level Objective): the target value for an SLI over a time window, e.g., 99.9% availability over 30 days — an internal commitment. SLA (Service Level Agreement): a contractual commitment to customers, typically less strict than the SLO (e.g., 99.5% in the contract) to provide buffer. Violating the SLA has financial consequences; violating the SLO triggers internal reliability work.

Question 2

What is an error budget and how is it calculated?

Accepted Answer

An error budget is the amount of unreliability an SLO permits. For a 99.9% SLO over 30 days: error budget = 30 days * (1 - 0.999) = 43.2 minutes of allowable downtime. It can also be expressed as: 0.1% of total requests may fail. When the error budget is full, teams can deploy freely. As incidents consume budget, teams reduce deployment risk. When budget is exhausted, non-critical deployments freeze until the 30-day window resets.

Question 3

How does burn rate alerting work for error budgets?

Accepted Answer

Burn rate is how fast you are consuming the error budget relative to the sustainable rate. A burn rate of 1x consumes the budget at exactly the SLO's allowed pace. Alert when burn rate exceeds a threshold (e.g., 14.4x) for a sustained window (5 minutes) — at that rate, you'd exhaust a 30-day budget in 2 hours. This gives early warning before the budget is gone. Burn rate alerting is more actionable than raw error rate alerts because it correlates consumption with the SLO window.

Question 4

How should you set realistic SLOs for a service?

Accepted Answer

Start by measuring current reliability as a baseline. Survey users or run experiments to find the latency/availability threshold where users are actually unhappy. Set the SLO to match real user needs — not aspirational five-nines. The SLO should be achievable with the current architecture but require intentional effort to maintain. Review quarterly: tighten when you have consistent headroom, relax when meeting the SLO causes disproportionate operational toil relative to user benefit.

Low Level Design: SLO, SLA, SLI, and Error Budget

Service Level Indicator (SLI)

Service Level Objective (SLO)

Service Level Agreement (SLA)

Error Budget

Error Budget Policy

SLO Measurement and Alerting

Setting Useful SLOs