SLA Monitoring System: Low-Level Design
An SLA monitoring system tracks whether a service is meeting its availability, latency, and error rate commitments. It ingests metrics, evaluates them against defined thresholds, fires alerts on violations, and computes rolling uptime percentages for customer-facing status pages and internal reporting. This design covers the metric ingestion pipeline, threshold evaluation engine, alert deduplication, and the rolling uptime calculation used in SLA credits.
Core Data Model
CREATE TABLE SlaDefinition (
sla_id SERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL, -- availability, p99_latency_ms, error_rate_pct
threshold NUMERIC(10,4) NOT NULL, -- 99.9 for availability, 500 for p99 ms
comparison VARCHAR(10) NOT NULL, -- gte, lte, gt, lt
window_minutes INT NOT NULL DEFAULT 5, -- evaluation window
severity VARCHAR(20) NOT NULL DEFAULT 'warning', -- warning, critical, page
is_active BOOLEAN NOT NULL DEFAULT TRUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE MetricDatapoint (
datapoint_id BIGSERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL,
value NUMERIC(12,4) NOT NULL,
sampled_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (sampled_at);
CREATE TABLE SlaViolation (
violation_id BIGSERIAL PRIMARY KEY,
sla_id INT NOT NULL REFERENCES SlaDefinition(sla_id),
service_name VARCHAR(100) NOT NULL,
metric_type VARCHAR(50) NOT NULL,
observed_value NUMERIC(12,4) NOT NULL,
threshold NUMERIC(10,4) NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
resolved_at TIMESTAMPTZ,
alert_sent BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE UptimeWindow (
window_id BIGSERIAL PRIMARY KEY,
service_name VARCHAR(100) NOT NULL,
window_start TIMESTAMPTZ NOT NULL,
window_end TIMESTAMPTZ NOT NULL,
total_minutes INT NOT NULL,
downtime_minutes NUMERIC(8,2) NOT NULL DEFAULT 0,
uptime_pct NUMERIC(6,4) GENERATED ALWAYS AS
(100.0 * (total_minutes - downtime_minutes) / total_minutes) STORED,
UNIQUE (service_name, window_start)
);
CREATE INDEX ON MetricDatapoint(service_name, metric_type, sampled_at DESC);
CREATE INDEX ON SlaViolation(sla_id, resolved_at) WHERE resolved_at IS NULL;
Metric Ingestion and Evaluation
import datetime, statistics
def ingest_metric(service_name: str, metric_type: str, value: float):
"""Called by services on each health check (e.g., every 30 seconds)."""
db.execute("""
INSERT INTO MetricDatapoint (service_name, metric_type, value, sampled_at)
VALUES (%s,%s,%s,NOW())
""", (service_name, metric_type, value))
# Prune old datapoints (keep 7 days)
db.execute("""
DELETE FROM MetricDatapoint
WHERE service_name=%s AND metric_type=%s AND sampled_at float:
rows = db.fetchall("""
SELECT value FROM MetricDatapoint
WHERE service_name=%s AND metric_type=%s
AND sampled_at >= NOW() - INTERVAL '%s minutes'
ORDER BY sampled_at DESC
""", (service_name, metric_type, window_minutes))
if not rows:
return None
values = [float(r['value']) for r in rows]
if metric_type == 'p99_latency_ms':
values.sort()
idx = int(len(values) * 0.99)
return values[min(idx, len(values) - 1)]
elif metric_type == 'availability':
return statistics.mean(values)
elif metric_type == 'error_rate_pct':
return statistics.mean(values)
return statistics.mean(values)
def _evaluate_threshold(value: float, threshold: float, comparison: str) -> bool:
ops = {'gte': value >= threshold, 'lte': value threshold, 'lt': value < threshold}
return ops.get(comparison, False)
def _open_violation(sla: dict, value: float):
db.execute("""
INSERT INTO SlaViolation
(sla_id, service_name, metric_type, observed_value, threshold)
VALUES (%s,%s,%s,%s,%s)
""", (sla['sla_id'], sla['service_name'], sla['metric_type'],
value, sla['threshold']))
if sla['severity'] in ('critical', 'page'):
_send_alert(sla, value)
def _close_violation(violation_id: int):
db.execute("""
UPDATE SlaViolation SET resolved_at=NOW() WHERE violation_id=%s
""", (violation_id,))
Rolling Uptime Calculation
def compute_monthly_uptime(service_name: str, year: int, month: int) -> dict:
"""
Compute availability percentage for SLA credit calculations.
Downtime = sum of violation durations where metric_type='availability'.
"""
window_start = datetime.datetime(year, month, 1)
next_month = (window_start.replace(day=28) + datetime.timedelta(days=4)).replace(day=1)
window_end = next_month
total_minutes = int((window_end - window_start).total_seconds() / 60)
# Sum violation durations in the month
downtime = db.fetchone("""
SELECT COALESCE(SUM(
EXTRACT(EPOCH FROM (COALESCE(resolved_at, NOW()) - started_at)) / 60
), 0) AS downtime_minutes
FROM SlaViolation v
JOIN SlaDefinition s USING(sla_id)
WHERE v.service_name=%s
AND s.metric_type='availability'
AND v.started_at %s)
""", (service_name, window_end, window_start))
downtime_minutes = float(downtime['downtime_minutes'])
uptime_pct = 100.0 * (total_minutes - downtime_minutes) / total_minutes
db.execute("""
INSERT INTO UptimeWindow
(service_name, window_start, window_end, total_minutes, downtime_minutes)
VALUES (%s,%s,%s,%s,%s)
ON CONFLICT (service_name, window_start) DO UPDATE
SET downtime_minutes=EXCLUDED.downtime_minutes
""", (service_name, window_start, window_end, total_minutes, downtime_minutes))
# SLA credit tiers
credit_pct = 0
if uptime_pct < 95.0: credit_pct = 25
elif uptime_pct < 99.0: credit_pct = 10
elif uptime_pct < 99.9: credit_pct = 5
return {
'uptime_pct': round(uptime_pct, 4),
'downtime_minutes': round(downtime_minutes, 2),
'credit_pct': credit_pct,
}
Key Design Decisions
- Violation state machine (open/closed) deduplicates alerts: without state tracking, every evaluation cycle that finds a threshold breach would fire a new alert — flooding on-call with hundreds of duplicate pages during a 2-hour outage. The open violation record acts as a dedup key: only one alert is sent when the violation opens, and a recovery notification is sent when it closes.
- Windowed evaluation (5-minute window): a single bad datapoint (one 503 response) should not open a violation. Averaging over 5 minutes smooths transient spikes. For latency SLAs, compute the true p99 over the window, not the average — average latency can look fine while 1% of requests are timing out.
- Generated column for uptime_pct: the GENERATED ALWAYS AS expression computes uptime_pct automatically from total_minutes and downtime_minutes on every insert/update — no application logic needed. It’s always consistent and queryable efficiently.
- Partition MetricDatapoint by day/week: high-frequency metric ingestion (30-second intervals × 100 services × 10 metric types = 2M rows/day) requires partitioning. Use daily partitions and drop partitions older than 7 days. The SLA violation and uptime calculations only need the last 30 days.
SLA monitoring and service reliability design is discussed in Google system design interview questions.
SLA monitoring and uptime tracking system design is covered in Amazon system design interview preparation.
SLA monitoring and data pipeline reliability design is discussed in Databricks system design interview guide.