Q: How do you set SLA thresholds that are achievable but meaningful?

Thresholds set too tight generate false alerts and alert fatigue; too loose and real degradations go unnoticed. Setting process: (1) measure baseline: collect 30 days of historical metrics (p99 latency, error rate, availability) at the SlaDefinition's evaluation granularity; (2) identify natural variance: what is the normal range? If p99 latency ranges from 80ms to 150ms on normal days, a 500ms threshold is meaningful but 200ms would create false alerts; (3) set thresholds at 2–3 standard deviations above the mean, or at the 99th percentile of historical values (99% of historical evaluations should pass); (4) calibrate severity: warning at 1 standard deviation above normal, critical at 2, page at 3; (5) re-evaluate quarterly as traffic patterns change. For new services with no baseline: use industry benchmarks (error rate < 0.1%, API p99 < 500ms) and tighten once baseline data is available.

Q: How do you calculate the exact downtime for an SLA credit claim?

Downtime for credit purposes must be measured precisely and documented. Method: sum the duration of all SlaViolation records where metric_type='availability' and the violation falls within the billing period. For violations spanning the period boundary (violation started Nov 30, ended Dec 1): count only the portion within the billing period (Dec 1 00:00 to the resolved_at time). SQL: COALESCE(resolved_at, NOW()) clips open violations at NOW(). Round to the nearest minute. Document the downtime calculation methodology in your SLA agreement (not just the uptime percentage target) — customers will challenge it. For credits: typically calculated as: credit_amount = monthly_fee × credit_pct. A 10% credit on a $1,000/month plan = $100 credit. Issue as account credit, not cash refund.

Q: How do you monitor SLAs for a service you don't control (third-party dependency)?

Your SLA to customers includes your service's uptime, but if Stripe goes down and your payment processing fails, that contributes to your effective downtime even though Stripe violated their SLA. Two approaches: (1) external monitoring: set up your own health checks against the third-party's API endpoints (not just their status page, which may lag). Track third_party_availability in MetricDatapoint with service_name='stripe'. If Stripe's API returns 503, open a violation in SlaViolation. (2) Cascading SLA attribution: if your API fails because Stripe is down, attribute the violation to the third-party dependency in SlaViolation.failure_cause. This allows you to present customers with a breakdown: "Total downtime: 47 minutes. Of which, 42 minutes due to payment processor (Stripe) outage." Some SLAs explicitly exclude third-party-caused downtime — document this in your SLA agreement.

Q: What is error budget and how does it change engineering team behavior?

An error budget is the allowed amount of downtime or errors under an SLA. A 99.9% availability SLA has an error budget of 0.1% of time per month = 43.8 minutes/month. If the team deploys frequently and each deploy causes 5 minutes of elevated errors, they can afford ~8 deploys per month before the budget is exhausted. When the budget is near-zero: the team should focus on reliability work (reducing deploy-induced errors, adding retries) rather than features. When plenty of budget remains: the team can take more deployment risk for faster feature delivery. Error budgets align incentives between product (ships features fast) and SRE (keeps things reliable): both teams manage the same budget. Track budget consumption in the monitoring system: SELECT 100.0 - uptime_pct AS budget_consumed_pct FROM UptimeWindow WHERE service_name='api' AND window_start=date_trunc('month', NOW()). Alert when 50% and 75% of the monthly budget is consumed.

Question 1

What is the difference between availability, reliability, and durability in SLA definitions?

Accepted Answer

Availability: the percentage of time the service is accessible and responsive — "99.9% uptime" means at most 8.7 hours of downtime per year. Measured by health checks succeeding. Reliability: the probability that a specific operation succeeds when the service is available — "99.95% of API calls succeed." A service can be available (health checks pass) but unreliable (50% of requests return 500). Measured by error rate. Durability: the probability that stored data is not lost — "11 nines durability" (S3's claim) means data loss probability of 0.000000001% per year. Measured by data recovery success rate after failure scenarios. In an SLA monitoring system: track separate SlaDefinitions for each dimension. A payment service might have: availability SLA (99.99%), reliability SLA (error_rate < 0.01%), latency SLA (p99 < 200ms), and durability SLA (no transaction loss). Violating one does not necessarily violate others.

Question 2

How do you set SLA thresholds that are achievable but meaningful?

Accepted Answer

Thresholds set too tight generate false alerts and alert fatigue; too loose and real degradations go unnoticed. Setting process: (1) measure baseline: collect 30 days of historical metrics (p99 latency, error rate, availability) at the SlaDefinition's evaluation granularity; (2) identify natural variance: what is the normal range? If p99 latency ranges from 80ms to 150ms on normal days, a 500ms threshold is meaningful but 200ms would create false alerts; (3) set thresholds at 2–3 standard deviations above the mean, or at the 99th percentile of historical values (99% of historical evaluations should pass); (4) calibrate severity: warning at 1 standard deviation above normal, critical at 2, page at 3; (5) re-evaluate quarterly as traffic patterns change. For new services with no baseline: use industry benchmarks (error rate < 0.1%, API p99 < 500ms) and tighten once baseline data is available.

Question 3

How do you calculate the exact downtime for an SLA credit claim?

Accepted Answer

Downtime for credit purposes must be measured precisely and documented. Method: sum the duration of all SlaViolation records where metric_type='availability' and the violation falls within the billing period. For violations spanning the period boundary (violation started Nov 30, ended Dec 1): count only the portion within the billing period (Dec 1 00:00 to the resolved_at time). SQL: COALESCE(resolved_at, NOW()) clips open violations at NOW(). Round to the nearest minute. Document the downtime calculation methodology in your SLA agreement (not just the uptime percentage target) — customers will challenge it. For credits: typically calculated as: credit_amount = monthly_fee × credit_pct. A 10% credit on a $1,000/month plan = $100 credit. Issue as account credit, not cash refund.

Question 4

How do you monitor SLAs for a service you don't control (third-party dependency)?

Accepted Answer

Your SLA to customers includes your service's uptime, but if Stripe goes down and your payment processing fails, that contributes to your effective downtime even though Stripe violated their SLA. Two approaches: (1) external monitoring: set up your own health checks against the third-party's API endpoints (not just their status page, which may lag). Track third_party_availability in MetricDatapoint with service_name='stripe'. If Stripe's API returns 503, open a violation in SlaViolation. (2) Cascading SLA attribution: if your API fails because Stripe is down, attribute the violation to the third-party dependency in SlaViolation.failure_cause. This allows you to present customers with a breakdown: "Total downtime: 47 minutes. Of which, 42 minutes due to payment processor (Stripe) outage." Some SLAs explicitly exclude third-party-caused downtime — document this in your SLA agreement.

Question 5

What is error budget and how does it change engineering team behavior?

Accepted Answer

An error budget is the allowed amount of downtime or errors under an SLA. A 99.9% availability SLA has an error budget of 0.1% of time per month = 43.8 minutes/month. If the team deploys frequently and each deploy causes 5 minutes of elevated errors, they can afford ~8 deploys per month before the budget is exhausted. When the budget is near-zero: the team should focus on reliability work (reducing deploy-induced errors, adding retries) rather than features. When plenty of budget remains: the team can take more deployment risk for faster feature delivery. Error budgets align incentives between product (ships features fast) and SRE (keeps things reliable): both teams manage the same budget. Track budget consumption in the monitoring system: SELECT 100.0 - uptime_pct AS budget_consumed_pct FROM UptimeWindow WHERE service_name='api' AND window_start=date_trunc('month', NOW()). Alert when 50% and 75% of the monthly budget is consumed.

SLA Monitoring System Low-Level Design: Metric Ingestion, Threshold Evaluation, Alert Deduplication, and Uptime Calculation

SLA Monitoring System: Low-Level Design

Core Data Model

Metric Ingestion and Evaluation

Rolling Uptime Calculation

Key Design Decisions