Service Registry Low-Level Design: Registration, Health Monitoring, and Client-Side Discovery

A service registry is the backbone of dynamic service discovery in microservice architectures. Rather than hardcoding service endpoints, every instance registers itself on startup and clients query the registry at runtime. This post covers the full low-level design: registration protocol, health monitoring, client-side discovery, and watch-based change notification.

Registration Protocol

Every service instance registers on startup by sending its service_name, host, port, and a metadata blob (version, region, tags). Registration uses a two-phase approach: the instance first registers with status STARTING, then transitions to HEALTHY only after passing its own internal health checks. This prevents client traffic from reaching instances that are still warming up.

Leases use TTL-based renewal: each registered instance holds a lease expiring at lease_expires_at. The instance sends periodic heartbeats (e.g., every 5 seconds) to renew the lease. If the registry receives no heartbeat before expiry, the instance is automatically deregistered. TTL is typically 3x the heartbeat interval to tolerate transient network delays.

Health Probing

Beyond passive TTL-based expiry, the registry actively probes each registered instance. Probes can be HTTP GET to a /health endpoint or TCP connect. The registry tracks consecutive failures per instance. After N consecutive failures (e.g., 3), the instance is deregistered and clients are notified. This handles cases where an instance is alive (sending heartbeats) but not serving traffic correctly.

The HealthCheck table records every probe outcome so failure trends are observable. A probe scheduler runs as a background worker, distributing probes evenly across time to avoid thundering herd on the probe targets.

Client-Side Discovery

Clients query the registry for all healthy instances of a target service. The response includes host, port, and metadata for each instance. Clients cache this list with a short TTL (e.g., 30 seconds) to avoid registry overload. On cache expiry or on error connecting to a returned instance, the client re-queries.

Load balancing is performed client-side using the cached instance list:

Round-robin: cycle through instances in order
Random: pick uniformly at random — avoids synchronized waves
Weighted: prefer instances with higher capacity metadata

Watch / Subscribe

Polling for changes adds latency between a registry update and client awareness. Instead, clients subscribe to changes for a given service name. The registry maintains a ServiceWatcher table of callback URLs. When instance health status changes (new registration, deregistration, or status flip), the registry fans out delta updates to all registered watchers via HTTP POST or long-poll response.

This enables near-real-time convergence: clients stop routing to a deregistered instance within seconds rather than after the next poll cycle.

SQL Schema

-- Service instances and their lease state
CREATE TABLE ServiceInstance (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    service_name    TEXT NOT NULL,
    host            TEXT NOT NULL,
    port            INTEGER NOT NULL,
    metadata        JSONB NOT NULL DEFAULT '{}',
    status          TEXT NOT NULL DEFAULT 'STARTING',  -- STARTING, HEALTHY, UNHEALTHY
    lease_expires_at TIMESTAMPTZ NOT NULL,
    registered_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT chk_status CHECK (status IN ('STARTING','HEALTHY','UNHEALTHY'))
);

CREATE INDEX idx_si_service_status ON ServiceInstance(service_name, status);
CREATE INDEX idx_si_lease ON ServiceInstance(lease_expires_at);

-- Health probe results
CREATE TABLE HealthCheck (
    id                   BIGSERIAL PRIMARY KEY,
    instance_id          UUID NOT NULL REFERENCES ServiceInstance(id) ON DELETE CASCADE,
    status               TEXT NOT NULL,  -- OK, FAIL
    last_checked_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    consecutive_failures INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX idx_hc_instance ON HealthCheck(instance_id, last_checked_at DESC);

-- Watch subscriptions for change notification
CREATE TABLE ServiceWatcher (
    id               BIGSERIAL PRIMARY KEY,
    service_name     TEXT NOT NULL,
    callback_url     TEXT NOT NULL,
    last_notified_at TIMESTAMPTZ,
    UNIQUE(service_name, callback_url)
);

Python Interface

import time
import uuid
from datetime import datetime, timedelta
from typing import Optional

class ServiceRegistryClient:
    def __init__(self, registry_url: str, ttl_seconds: int = 15):
        self.registry_url = registry_url
        self.ttl_seconds = ttl_seconds

    def register_instance(
        self,
        service_name: str,
        host: str,
        port: int,
        metadata: dict
    ) -> str:
        """Register a service instance; returns instance_id. Status starts as STARTING."""
        payload = {
            "service_name": service_name,
            "host": host,
            "port": port,
            "metadata": metadata,
            "ttl_seconds": self.ttl_seconds,
        }
        # POST /instances -> {"instance_id": "..."}
        response = self._post("/instances", payload)
        return response["instance_id"]

    def renew_lease(self, instance_id: str) -> None:
        """Heartbeat: extend lease_expires_at by TTL from now."""
        self._put(f"/instances/{instance_id}/heartbeat", {})

    def mark_healthy(self, instance_id: str) -> None:
        """Transition instance from STARTING to HEALTHY after self-check passes."""
        self._put(f"/instances/{instance_id}/status", {"status": "HEALTHY"})

    def deregister_instance(self, instance_id: str) -> None:
        """Explicit deregistration on graceful shutdown."""
        self._delete(f"/instances/{instance_id}")

    def discover_instances(self, service_name: str) -> list[dict]:
        """Return list of healthy instances with host/port/metadata."""
        return self._get(f"/services/{service_name}/instances")

    def subscribe(self, service_name: str, callback_url: str) -> None:
        """Register a watch callback for service change events."""
        self._post("/watchers", {
            "service_name": service_name,
            "callback_url": callback_url
        })

    def _get(self, path: str) -> dict:
        import urllib.request, json
        with urllib.request.urlopen(self.registry_url + path) as r:
            return json.loads(r.read())

    def _post(self, path: str, payload: dict) -> dict:
        import urllib.request, json
        data = json.dumps(payload).encode()
        req = urllib.request.Request(self.registry_url + path, data=data,
                                     headers={"Content-Type": "application/json"})
        with urllib.request.urlopen(req) as r:
            return json.loads(r.read())

    def _put(self, path: str, payload: dict) -> dict:
        import urllib.request, json
        data = json.dumps(payload).encode()
        req = urllib.request.Request(self.registry_url + path, data=data,
                                     headers={"Content-Type": "application/json"},
                                     method="PUT")
        with urllib.request.urlopen(req) as r:
            return json.loads(r.read())

    def _delete(self, path: str) -> None:
        import urllib.request
        req = urllib.request.Request(self.registry_url + path, method="DELETE")
        urllib.request.urlopen(req)


class RoundRobinBalancer:
    def __init__(self):
        self._index = 0

    def pick(self, instances: list[dict]) -> Optional[dict]:
        if not instances:
            return None
        instance = instances[self._index % len(instances)]
        self._index += 1
        return instance

Design Trade-offs and Failure Modes

Split-brain scenarios: If the registry cluster partitions, different clients may see different instance sets. Prefer CP registries (etcd, Zookeeper) for consistency or AP (Eureka) for availability — the choice depends on whether stale endpoints or missing endpoints are more harmful for your traffic.

Client cache TTL: Short TTL (5–10 seconds) means faster convergence after instance changes but more registry load. Long TTL (60+ seconds) reduces load but delays detection of dead instances. A reasonable default is 30 seconds with watch-based invalidation.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between TTL-based and push-based deregistration in a service registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “TTL-based deregistration removes an instance when its lease expires with no heartbeat renewal — passive, tolerates transient failures, but has a delay equal to TTL. Push-based (explicit deregister call) is immediate but requires the instance to know it is shutting down. Both should be used together: push deregister on graceful shutdown, TTL as a safety net for crashes.”
}
},
{
“@type”: “Question”,
“name”: “When should clients use watch/subscribe instead of polling the service registry?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Watch-based notification is preferred when low latency convergence matters — clients learn about instance changes within seconds rather than after the next poll interval. Polling is simpler and sufficient when change frequency is low or when the client stack does not support persistent connections. For high-churn deployments, watches significantly reduce stale routing.”
}
},
{
“@type”: “Question”,
“name”: “How does a service registry handle split-brain network partitions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a split-brain partition, different registry nodes may see different member sets. CP registries (etcd, Zookeeper) sacrifice availability — the minority partition stops accepting writes. AP registries (Eureka) remain available but may serve inconsistent instance lists. Clients can mitigate this with retry logic and circuit breakers to handle endpoints that are listed but unreachable.”
}
},
{
“@type”: “Question”,
“name”: “What is the ideal client-side cache TTL for service registry lookups?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A cache TTL of 15–30 seconds balances freshness against registry load for most services. Combine it with watch-based invalidation so that critical changes (instance deregistered) propagate immediately. Avoid TTLs below 5 seconds as they create registry hotspots, and avoid TTLs above 60 seconds as they cause prolonged routing to dead instances.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does TTL-based lease prevent stale registrations?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each registration has a lease_expires_at; if the instance fails to renew the lease before expiry, the registry marks it as expired and removes it from the healthy pool; this handles crashed instances that cannot explicitly deregister.”
}
},
{
“@type”: “Question”,
“name”: “How do watch subscriptions enable instant discovery updates?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Clients subscribe to change notifications for a service; when an instance registers, deregisters, or changes health status, the registry pushes a delta update to all watchers, eliminating polling delays.”
}
},
{
“@type”: “Question”,
“name”: “How is split-brain handled if the registry becomes partitioned?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “During a network partition, clients use their locally cached list of healthy instances; stale entries may cause some requests to fail; automatic retry and circuit breaker patterns in clients tolerate transient discovery staleness.”
}
},
{
“@type”: “Question”,
“name”: “What is the two-phase registration pattern?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A service first registers with status=STARTING, performs its own startup health checks, then updates status to HEALTHY; this prevents the registry from routing traffic to an instance that has registered but is not yet ready to serve.”
}
}
]
}