Health Check Service Low-Level Design: Active Probing, Dependency Graph, and Alerting

A health check service continuously probes every service in a platform, aggregates results through a dependency graph, classifies health states, routes alerts on transitions, and exposes a public status endpoint. This guide covers the low-level design: probe types, scheduling, state machine, dependency rollup, degraded classification, alert routing, and the status page feed.

1. Check Types

HTTP: send a GET (or configured method) to the health endpoint. A 2xx response within the timeout is HEALTHY. Any non-2xx or timeout is a failure.
TCP: open a TCP connection to host:port. Successful connection within timeout = HEALTHY. Connection refused or timeout = failure.
Custom script: execute a shell command or call an internal RPC. Exit code 0 = HEALTHY; non-zero = failure. Used for checks that require application-level logic (e.g., queue depth < threshold).

Response time is always recorded regardless of outcome; it feeds the DEGRADED classification even when the status code is 2xx.

2. Check Scheduling

Each HealthCheck row defines its own interval_seconds (10s to 60s) and timeout_ms. A scheduler service uses a min-heap (priority queue) ordered by next_run_at. On each tick it pops all checks due for execution, dispatches them to a worker pool, and re-enqueues each check at now + interval_seconds after the result is recorded.

Workers run checks concurrently. A semaphore limits total concurrent checks to prevent thundering-herd against upstream services. Check execution is stateless: the worker reads config from the DB, performs the probe, and writes a CheckResult row, then calls evaluate_state.

CREATE TABLE HealthCheck (
    id                  BIGSERIAL PRIMARY KEY,
    service_name        TEXT NOT NULL,
    check_type          TEXT NOT NULL,        -- HTTP | TCP | SCRIPT
    endpoint            TEXT NOT NULL,        -- URL, host:port, or script path
    interval_seconds    INT NOT NULL DEFAULT 30,
    timeout_ms          INT NOT NULL DEFAULT 5000,
    failure_threshold   INT NOT NULL DEFAULT 3,   -- consecutive fails -> UNHEALTHY
    recovery_threshold  INT NOT NULL DEFAULT 2,   -- consecutive successes -> HEALTHY
    warning_threshold_ms INT,                     -- response time -> DEGRADED
    enabled             BOOLEAN NOT NULL DEFAULT TRUE
);

CREATE TABLE CheckResult (
    id              BIGSERIAL PRIMARY KEY,
    check_id        BIGINT NOT NULL REFERENCES HealthCheck(id),
    status          TEXT NOT NULL,            -- HEALTHY | DEGRADED | UNHEALTHY
    response_time_ms INT,
    error_message   TEXT,
    checked_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE ServiceHealth (
    service_name    TEXT PRIMARY KEY,
    current_status  TEXT NOT NULL DEFAULT 'UNKNOWN',
    last_changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    consecutive_failures INT NOT NULL DEFAULT 0,
    consecutive_successes INT NOT NULL DEFAULT 0
);

CREATE TABLE ServiceDependency (
    service_name    TEXT NOT NULL,
    depends_on      TEXT NOT NULL,
    PRIMARY KEY (service_name, depends_on)
);

3. State Machine

Health state transitions use consecutive counts, not a simple last-result flag. This prevents flapping from a single transient failure:

UNKNOWN
  -> HEALTHY  (first successful probe)

HEALTHY
  -> DEGRADED (1+ failure OR response_time > warning_threshold_ms)
  -> UNHEALTHY (consecutive_failures >= failure_threshold)

DEGRADED
  -> HEALTHY  (consecutive_successes >= recovery_threshold AND response_time ok)
  -> UNHEALTHY (consecutive_failures >= failure_threshold)

UNHEALTHY
  -> DEGRADED (1 success)
  -> HEALTHY  (consecutive_successes >= recovery_threshold)

The state is stored in ServiceHealth. State transitions trigger an event published to an alert routing queue.

4. Dependency Graph and Health Rollup

Services declare upstream dependencies in ServiceDependency. Aggregate health is computed as the minimum (worst) health across a service and all its transitive dependencies:

health_order = {HEALTHY: 0, DEGRADED: 1, UNHEALTHY: 2, UNKNOWN: 3}

aggregate_health(service) = max(
    own_health(service),
    max(aggregate_health(dep) for dep in direct_dependencies(service))
)

A cycle in the dependency graph would cause infinite recursion; detect cycles during dependency registration (topological sort) and reject circular declarations.

A payment service with a dependency on a DEGRADED database is itself reported as DEGRADED on the status page, even if its own probes pass. This prevents misleading green status when a critical dependency is struggling.

5. Degraded Classification

DEGRADED is a distinct state between HEALTHY and UNHEALTHY:

HTTP probe returns 2xx but response_time_ms > warning_threshold_ms → DEGRADED
First failure (before failure_threshold is reached) → DEGRADED
A dependency is DEGRADED or UNHEALTHY → service is DEGRADED at minimum

DEGRADED allows differentiated alerting: page on-call for UNHEALTHY, send a Slack warning for DEGRADED. It also feeds SLO tracking: time spent DEGRADED counts toward error budget consumption at a configurable partial rate.

6. Alert Routing

On any ServiceHealth.current_status transition, publish a state-change event to an internal message bus. The alert router subscribes and applies routing rules:

HEALTHY → UNHEALTHY: create PagerDuty incident (high urgency)
HEALTHY → DEGRADED: post Slack message to #alerts-{team}
UNHEALTHY → HEALTHY: resolve PagerDuty incident; post recovery message
DEGRADED → HEALTHY: post Slack recovery message

Alert routing rules are stored in a configuration table keyed by service name and target state, allowing per-service customization without code changes.

7. Public Status Page

A read-only endpoint aggregates ServiceHealth for all public-facing services and returns a JSON response. The status page frontend polls this endpoint every 30 seconds. The endpoint is served from a CDN with a short TTL (15s) so a single spike in status page traffic does not hit the database.

{
  "updated_at": "2025-04-17T10:00:00Z",
  "overall": "DEGRADED",
  "services": [
    {"name": "API Gateway", "status": "HEALTHY", "last_changed_at": "2025-04-16T08:00:00Z"},
    {"name": "Payment Service", "status": "DEGRADED", "last_changed_at": "2025-04-17T09:55:00Z"},
    {"name": "Auth Service", "status": "HEALTHY", "last_changed_at": "2025-04-15T12:00:00Z"}
  ]
}

Overall status = worst status across all listed services. A separate incident feed lists current open incidents with timestamps and update messages managed by the on-call team.

8. Python Reference Implementation

import time, socket, requests
from enum import Enum
from typing import Optional

class HealthStatus(str, Enum):
    HEALTHY   = "HEALTHY"
    DEGRADED  = "DEGRADED"
    UNHEALTHY = "UNHEALTHY"
    UNKNOWN   = "UNKNOWN"

STATUS_ORDER = {HealthStatus.HEALTHY: 0, HealthStatus.DEGRADED: 1,
                HealthStatus.UNHEALTHY: 2, HealthStatus.UNKNOWN: 3}

def run_http_check(check: dict) -> dict:
    """Execute an HTTP health probe and return a result dict."""
    start = time.monotonic()
    try:
        resp = requests.get(
            check["endpoint"],
            timeout=check["timeout_ms"] / 1000.0,
            allow_redirects=True
        )
        elapsed_ms = int((time.monotonic() - start) * 1000)
        if 200 <= resp.status_code  warning_ms
                      else HealthStatus.HEALTHY)
            return {"status": status, "response_time_ms": elapsed_ms, "error_message": None}
        return {"status": HealthStatus.UNHEALTHY,
                "response_time_ms": elapsed_ms,
                "error_message": f"HTTP {resp.status_code}"}
    except requests.Timeout:
        return {"status": HealthStatus.UNHEALTHY,
                "response_time_ms": check["timeout_ms"],
                "error_message": "Timeout"}
    except Exception as e:
        return {"status": HealthStatus.UNHEALTHY,
                "response_time_ms": None,
                "error_message": str(e)}

def evaluate_state(check_id: int, new_result: dict) -> Optional[HealthStatus]:
    """Update ServiceHealth state machine; return new status if it changed."""
    check = db.query_one("SELECT * FROM HealthCheck WHERE id=%s", [check_id])
    health = db.query_one(
        "SELECT * FROM ServiceHealth WHERE service_name=%s FOR UPDATE",
        [check.service_name]
    )
    if not health:
        db.execute(
            "INSERT INTO ServiceHealth (service_name, current_status) VALUES (%s,%s)",
            [check.service_name, new_result["status"]]
        )
        return new_result["status"]

    result_status = new_result["status"]
    is_failure = result_status in (HealthStatus.UNHEALTHY, HealthStatus.DEGRADED)

    new_consec_fail = (health.consecutive_failures + 1) if is_failure else 0
    new_consec_succ = (health.consecutive_successes + 1) if not is_failure else 0

    current = HealthStatus(health.current_status)
    new_status = current

    if is_failure:
        if new_consec_fail >= check.failure_threshold:
            new_status = HealthStatus.UNHEALTHY
        elif current == HealthStatus.HEALTHY:
            new_status = HealthStatus.DEGRADED
    else:
        if new_consec_succ >= check.recovery_threshold:
            new_status = HealthStatus.HEALTHY
        elif current == HealthStatus.UNHEALTHY:
            new_status = HealthStatus.DEGRADED

    changed = (new_status != current)
    db.execute(
        """UPDATE ServiceHealth SET current_status=%s, consecutive_failures=%s,
           consecutive_successes=%s, last_changed_at=CASE WHEN %s THEN NOW() ELSE last_changed_at END
           WHERE service_name=%s""",
        [new_status, new_consec_fail, new_consec_succ, changed, check.service_name]
    )
    db.execute(
        "INSERT INTO CheckResult (check_id, status, response_time_ms, error_message) VALUES (%s,%s,%s,%s)",
        [check_id, result_status, new_result.get("response_time_ms"), new_result.get("error_message")]
    )
    if changed:
        publish_state_change_event(check.service_name, current, new_status)
        return new_status
    return None

def compute_aggregate_health(service_name: str, visited: set = None) -> HealthStatus:
    """Recursively compute aggregate health including transitive dependencies."""
    if visited is None:
        visited = set()
    if service_name in visited:
        return HealthStatus.UNKNOWN   -- cycle guard
    visited.add(service_name)

    own = db.query_one(
        "SELECT current_status FROM ServiceHealth WHERE service_name=%s",
        [service_name]
    )
    own_status = HealthStatus(own.current_status) if own else HealthStatus.UNKNOWN

    deps = db.query(
        "SELECT depends_on FROM ServiceDependency WHERE service_name=%s",
        [service_name]
    )
    dep_statuses = [compute_aggregate_health(d.depends_on, visited) for d in deps]
    all_statuses = [own_status] + dep_statuses
    return max(all_statuses, key=lambda s: STATUS_ORDER[s])

9. Scalability Notes

Scheduler sharding: partition services across multiple scheduler instances by consistent hash of service_name; a distributed lock (Redis) prevents duplicate scheduling.
CheckResult retention: store only the last N results per check in a circular buffer table; archive older results to cold storage for SLO trend analysis.
Synthetic checks: checks that exercise a full user flow (login, add to cart, checkout) run less frequently (every 5 minutes) from multiple geographic regions; results feed the same state machine.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does dependency graph health rollup work in a health check service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each service declares its upstream dependencies. Aggregate health is computed as the worst (maximum severity) status across the service itself and all transitive dependencies. If a database is DEGRADED, all services that depend on it are reported as at least DEGRADED on the status page, even if their own probes pass.”
}
},
{
“@type”: “Question”,
“name”: “Why use a consecutive failure threshold instead of reacting to the first failure?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A single transient network hiccup or brief timeout can cause a false positive. Requiring N consecutive failures before declaring UNHEALTHY prevents alert fatigue from flapping. Similarly, requiring M consecutive successes before recovering from UNHEALTHY prevents premature recovery declarations.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between DEGRADED and UNHEALTHY in a health check state machine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “UNHEALTHY means the service has failed beyond the failure threshold and is considered non-functional. DEGRADED means the service is responding but with elevated latency, partial failures, or has a degraded dependency. DEGRADED typically triggers a lower-urgency alert (Slack) while UNHEALTHY triggers an on-call page (PagerDuty).”
}
},
{
“@type”: “Question”,
“name”: “What are synthetic health checks and when should you use them?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Synthetic checks simulate real user flows end-to-end (e.g., complete a login and checkout sequence) rather than just probing a single endpoint. They catch integration failures that individual service probes miss. They run less frequently (every 5 minutes) because they are heavier, and are often run from multiple geographic regions to detect regional outages.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the dependency graph health rollup work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each service declares its dependencies in ServiceDependency; compute_aggregate_health() recursively resolves dependency statuses and returns the worst status (UNHEALTHY beats DEGRADED beats HEALTHY) across the service and all its dependencies.”
}
},
{
“@type”: “Question”,
“name”: “How are consecutive failure and recovery thresholds implemented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A rolling counter tracks consecutive failures/successes per check; state transitions from HEALTHY to DEGRADED/UNHEALTHY require failure_threshold consecutive failures; recovery requires recovery_threshold consecutive successes.”
}
},
{
“@type”: “Question”,
“name”: “How is the health check scheduler implemented to avoid drift?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each check stores next_check_at; after each execution, next_check_at = MAX(NOW(), last_check_at + interval) to prevent drift from slow checks; a min-heap priority queue orders checks by next_check_at.”
}
},
{
“@type”: “Question”,
“name”: “How are status page updates served at scale without DB load?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The aggregated health JSON is computed on state change and written to a Redis key with a 30-second TTL; the status page endpoint reads from Redis, with CDN caching for public status pages.”
}
}
]
}