SLA Monitoring System: Low-Level Design
An SLA monitoring system tracks whether a service is meeting its availability, latency, and error rate commitments. It ingests metrics, evaluates them against defined thresholds, fires alerts on violations, and computes rolling uptime percentages for customer-facing status pages and internal reporting. This design covers the metric ingestion pipeline, threshold evaluation engine, alert deduplication, and the rolling uptime calculation used in SLA credits.
Core Data Model
CREATE TABLE SlaDefinition (
sla_id SERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL, -- availability, p99_latency_ms, error_rate_pct
threshold NUMERIC(10,4) NOT NULL, -- 99.9 for availability, 500 for p99 ms
comparison VARCHAR(10) NOT NULL, -- gte, lte, gt, lt
window_minutes INT NOT NULL DEFAULT 5, -- evaluation window
severity VARCHAR(20) NOT NULL DEFAULT 'warning', -- warning, critical, page
is_active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE MetricDatapoint (
datapoint_id BIGSERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL,
value NUMERIC(12,4) NOT NULL,
sampled_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (sampled_at);
CREATE TABLE SlaViolation (
violation_id BIGSERIAL PRIMARY KEY,
sla_id INT NOT NULL REFERENCES SlaDefinition(sla_id),
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL,
observed_value NUMERIC(12,4) NOT NULL,
threshold NUMERIC(10,4) NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ,
alert_sent BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE UptimeWindow (
window_id BIGSERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
window_start TIMESTAMPTZ NOT NULL,
window_end TIMESTAMPTZ NOT NULL,
total_minutes INT NOT NULL,
downtime_minutes NUMERIC(8,2) NOT NULL DEFAULT 0,
uptime_pct NUMERIC(6,4) GENERATED ALWAYS AS
(100.0 * (total_minutes - downtime_minutes) / total_minutes) STORED,
UNIQUE (service_name, window_start)
);
CREATE INDEX ON MetricDatapoint(service_name, metric_type, sampled_at DESC);
CREATE INDEX ON SlaViolation(sla_id, resolved_at) WHERE resolved_at IS NULL;
Metric Ingestion and Evaluation
import datetime, statistics
def ingest_metric(service_name: str, metric_type: str, value: float):
"""Called by services on each health check (e.g., every 30 seconds)."""
db.execute("""
INSERT INTO MetricDatapoint (service_name, metric_type, value, sampled_at)
VALUES (%s,%s,%s,NOW())
""", (service_name, metric_type, value))
# Prune old datapoints (keep 7 days)
db.execute("""
DELETE FROM MetricDatapoint
WHERE service_name=%s AND metric_type=%s AND sampled_at float:
rows = db.fetchall("""
SELECT value FROM MetricDatapoint
WHERE service_name=%s AND metric_type=%s
AND sampled_at >= NOW() - INTERVAL '%s minutes'
ORDER BY sampled_at DESC
""", (service_name, metric_type, window_minutes))
if not rows:
return None
values = [float(r['value']) for r in rows]
if metric_type == 'p99_latency_ms':
values.sort()
idx = int(len(values) * 0.99)
return values[min(idx, len(values) - 1)]
elif metric_type == 'availability':
return statistics.mean(values)
elif metric_type == 'error_rate_pct':
return statistics.mean(values)
return statistics.mean(values)
def _evaluate_threshold(value: float, threshold: float, comparison: str) -> bool:
ops = {'gte': value >= threshold, 'lte': value threshold, 'lt': value < threshold}
return ops.get(comparison, False)
def _open_violation(sla: dict, value: float):
db.execute("""
INSERT INTO SlaViolation
(sla_id, service_name, metric_type, observed_value, threshold)
VALUES (%s,%s,%s,%s,%s)
""", (sla['sla_id'], sla['service_name'], sla['metric_type'],
value, sla['threshold']))
if sla['severity'] in ('critical', 'page'):
_send_alert(sla, value)
def _close_violation(violation_id: int):
db.execute("""
UPDATE SlaViolation SET resolved_at=NOW() WHERE violation_id=%s
""", (violation_id,))
Rolling Uptime Calculation
def compute_monthly_uptime(service_name: str, year: int, month: int) -> dict:
"""
Compute availability percentage for SLA credit calculations.
Downtime = sum of violation durations where metric_type='availability'.
"""
window_start = datetime.datetime(year, month, 1)
next_month = (window_start.replace(day=28) + datetime.timedelta(days=4)).replace(day=1)
window_end = next_month
total_minutes = int((window_end - window_start).total_seconds() / 60)
# Sum violation durations in the month
downtime = db.fetchone("""
SELECT COALESCE(SUM(
EXTRACT(EPOCH FROM (COALESCE(resolved_at, NOW()) - started_at)) / 60
), 0) AS downtime_minutes
FROM SlaViolation v
JOIN SlaDefinition s USING(sla_id)
WHERE v.service_name=%s
AND s.metric_type='availability'
AND v.started_at %s)
""", (service_name, window_end, window_start))
downtime_minutes = float(downtime['downtime_minutes'])
uptime_pct = 100.0 * (total_minutes - downtime_minutes) / total_minutes
db.execute("""
INSERT INTO UptimeWindow
(service_name, window_start, window_end, total_minutes, downtime_minutes)
VALUES (%s,%s,%s,%s,%s)
ON CONFLICT (service_name, window_start) DO UPDATE
SET downtime_minutes=EXCLUDED.downtime_minutes
""", (service_name, window_start, window_end, total_minutes, downtime_minutes))
# SLA credit tiers
credit_pct = 0
if uptime_pct < 95.0: credit_pct = 25
elif uptime_pct < 99.0: credit_pct = 10
elif uptime_pct < 99.9: credit_pct = 5
return {
'uptime_pct': round(uptime_pct, 4),
'downtime_minutes': round(downtime_minutes, 2),
'credit_pct': credit_pct,
}
Key Design Decisions
- Violation state machine (open/closed) deduplicates alerts: without state tracking, every evaluation cycle that finds a threshold breach would fire a new alert — flooding on-call with hundreds of duplicate pages during a 2-hour outage. The open violation record acts as a dedup key: only one alert is sent when the violation opens, and a recovery notification is sent when it closes.
- Windowed evaluation (5-minute window): a single bad datapoint (one 503 response) should not open a violation. Averaging over 5 minutes smooths transient spikes. For latency SLAs, compute the true p99 over the window, not the average — average latency can look fine while 1% of requests are timing out.
- Generated column for uptime_pct: the GENERATED ALWAYS AS expression computes uptime_pct automatically from total_minutes and downtime_minutes on every insert/update — no application logic needed. It’s always consistent and queryable efficiently.
- Partition MetricDatapoint by day/week: high-frequency metric ingestion (30-second intervals × 100 services × 10 metric types = 2M rows/day) requires partitioning. Use daily partitions and drop partitions older than 7 days. The SLA violation and uptime calculations only need the last 30 days.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between availability, reliability, and durability in SLA definitions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Availability: the percentage of time the service is accessible and responsive — "99.9% uptime" means at most 8.7 hours of downtime per year. Measured by health checks succeeding. Reliability: the probability that a specific operation succeeds when the service is available — "99.95% of API calls succeed." A service can be available (health checks pass) but unreliable (50% of requests return 500). Measured by error rate. Durability: the probability that stored data is not lost — "11 nines durability" (S3’s claim) means data loss probability of 0.000000001% per year. Measured by data recovery success rate after failure scenarios. In an SLA monitoring system: track separate SlaDefinitions for each dimension. A payment service might have: availability SLA (99.99%), reliability SLA (error_rate < 0.01%), latency SLA (p99 < 200ms), and durability SLA (no transaction loss). Violating one does not necessarily violate others.”}},{“@type”:”Question”,”name”:”How do you set SLA thresholds that are achievable but meaningful?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Thresholds set too tight generate false alerts and alert fatigue; too loose and real degradations go unnoticed. Setting process: (1) measure baseline: collect 30 days of historical metrics (p99 latency, error rate, availability) at the SlaDefinition’s evaluation granularity; (2) identify natural variance: what is the normal range? If p99 latency ranges from 80ms to 150ms on normal days, a 500ms threshold is meaningful but 200ms would create false alerts; (3) set thresholds at 2–3 standard deviations above the mean, or at the 99th percentile of historical values (99% of historical evaluations should pass); (4) calibrate severity: warning at 1 standard deviation above normal, critical at 2, page at 3; (5) re-evaluate quarterly as traffic patterns change. For new services with no baseline: use industry benchmarks (error rate < 0.1%, API p99 < 500ms) and tighten once baseline data is available.”}},{“@type”:”Question”,”name”:”How do you calculate the exact downtime for an SLA credit claim?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Downtime for credit purposes must be measured precisely and documented. Method: sum the duration of all SlaViolation records where metric_type=’availability’ and the violation falls within the billing period. For violations spanning the period boundary (violation started Nov 30, ended Dec 1): count only the portion within the billing period (Dec 1 00:00 to the resolved_at time). SQL: COALESCE(resolved_at, NOW()) clips open violations at NOW(). Round to the nearest minute. Document the downtime calculation methodology in your SLA agreement (not just the uptime percentage target) — customers will challenge it. For credits: typically calculated as: credit_amount = monthly_fee × credit_pct. A 10% credit on a $1,000/month plan = $100 credit. Issue as account credit, not cash refund.”}},{“@type”:”Question”,”name”:”How do you monitor SLAs for a service you don’t control (third-party dependency)?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Your SLA to customers includes your service’s uptime, but if Stripe goes down and your payment processing fails, that contributes to your effective downtime even though Stripe violated their SLA. Two approaches: (1) external monitoring: set up your own health checks against the third-party’s API endpoints (not just their status page, which may lag). Track third_party_availability in MetricDatapoint with service_name=’stripe’. If Stripe’s API returns 503, open a violation in SlaViolation. (2) Cascading SLA attribution: if your API fails because Stripe is down, attribute the violation to the third-party dependency in SlaViolation.failure_cause. This allows you to present customers with a breakdown: "Total downtime: 47 minutes. Of which, 42 minutes due to payment processor (Stripe) outage." Some SLAs explicitly exclude third-party-caused downtime — document this in your SLA agreement.”}},{“@type”:”Question”,”name”:”What is error budget and how does it change engineering team behavior?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”An error budget is the allowed amount of downtime or errors under an SLA. A 99.9% availability SLA has an error budget of 0.1% of time per month = 43.8 minutes/month. If the team deploys frequently and each deploy causes 5 minutes of elevated errors, they can afford ~8 deploys per month before the budget is exhausted. When the budget is near-zero: the team should focus on reliability work (reducing deploy-induced errors, adding retries) rather than features. When plenty of budget remains: the team can take more deployment risk for faster feature delivery. Error budgets align incentives between product (ships features fast) and SRE (keeps things reliable): both teams manage the same budget. Track budget consumption in the monitoring system: SELECT 100.0 – uptime_pct AS budget_consumed_pct FROM UptimeWindow WHERE service_name=’api’ AND window_start=date_trunc(‘month’, NOW()). Alert when 50% and 75% of the monthly budget is consumed.”}}]}
SLA monitoring and service reliability design is discussed in Google system design interview questions.
SLA monitoring and uptime tracking system design is covered in Amazon system design interview preparation.
SLA monitoring and data pipeline reliability design is discussed in Databricks system design interview guide.