A service registry is the backbone of dynamic service discovery in microservice architectures. Rather than hardcoding service endpoints, every instance registers itself on startup and clients query the registry at runtime. This post covers the full low-level design: registration protocol, health monitoring, client-side discovery, and watch-based change notification.
Registration Protocol
Every service instance registers on startup by sending its service_name, host, port, and a metadata blob (version, region, tags). Registration uses a two-phase approach: the instance first registers with status STARTING, then transitions to HEALTHY only after passing its own internal health checks. This prevents client traffic from reaching instances that are still warming up.
Leases use TTL-based renewal: each registered instance holds a lease expiring at lease_expires_at. The instance sends periodic heartbeats (e.g., every 5 seconds) to renew the lease. If the registry receives no heartbeat before expiry, the instance is automatically deregistered. TTL is typically 3x the heartbeat interval to tolerate transient network delays.
Health Probing
Beyond passive TTL-based expiry, the registry actively probes each registered instance. Probes can be HTTP GET to a /health endpoint or TCP connect. The registry tracks consecutive failures per instance. After N consecutive failures (e.g., 3), the instance is deregistered and clients are notified. This handles cases where an instance is alive (sending heartbeats) but not serving traffic correctly.
The HealthCheck table records every probe outcome so failure trends are observable. A probe scheduler runs as a background worker, distributing probes evenly across time to avoid thundering herd on the probe targets.
Client-Side Discovery
Clients query the registry for all healthy instances of a target service. The response includes host, port, and metadata for each instance. Clients cache this list with a short TTL (e.g., 30 seconds) to avoid registry overload. On cache expiry or on error connecting to a returned instance, the client re-queries.
Load balancing is performed client-side using the cached instance list:
- Round-robin: cycle through instances in order
- Random: pick uniformly at random — avoids synchronized waves
- Weighted: prefer instances with higher capacity metadata
Watch / Subscribe
Polling for changes adds latency between a registry update and client awareness. Instead, clients subscribe to changes for a given service name. The registry maintains a ServiceWatcher table of callback URLs. When instance health status changes (new registration, deregistration, or status flip), the registry fans out delta updates to all registered watchers via HTTP POST or long-poll response.
This enables near-real-time convergence: clients stop routing to a deregistered instance within seconds rather than after the next poll cycle.
SQL Schema
-- Service instances and their lease state
CREATE TABLE ServiceInstance (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
service_name TEXT NOT NULL,
host TEXT NOT NULL,
port INTEGER NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
status TEXT NOT NULL DEFAULT 'STARTING', -- STARTING, HEALTHY, UNHEALTHY
lease_expires_at TIMESTAMPTZ NOT NULL,
registered_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT chk_status CHECK (status IN ('STARTING','HEALTHY','UNHEALTHY'))
);
CREATE INDEX idx_si_service_status ON ServiceInstance(service_name, status);
CREATE INDEX idx_si_lease ON ServiceInstance(lease_expires_at);
-- Health probe results
CREATE TABLE HealthCheck (
id BIGSERIAL PRIMARY KEY,
instance_id UUID NOT NULL REFERENCES ServiceInstance(id) ON DELETE CASCADE,
status TEXT NOT NULL, -- OK, FAIL
last_checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
consecutive_failures INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX idx_hc_instance ON HealthCheck(instance_id, last_checked_at DESC);
-- Watch subscriptions for change notification
CREATE TABLE ServiceWatcher (
id BIGSERIAL PRIMARY KEY,
service_name TEXT NOT NULL,
callback_url TEXT NOT NULL,
last_notified_at TIMESTAMPTZ,
UNIQUE(service_name, callback_url)
);
Python Interface
import time
import uuid
from datetime import datetime, timedelta
from typing import Optional
class ServiceRegistryClient:
def __init__(self, registry_url: str, ttl_seconds: int = 15):
self.registry_url = registry_url
self.ttl_seconds = ttl_seconds
def register_instance(
self,
service_name: str,
host: str,
port: int,
metadata: dict
) -> str:
"""Register a service instance; returns instance_id. Status starts as STARTING."""
payload = {
"service_name": service_name,
"host": host,
"port": port,
"metadata": metadata,
"ttl_seconds": self.ttl_seconds,
}
# POST /instances -> {"instance_id": "..."}
response = self._post("/instances", payload)
return response["instance_id"]
def renew_lease(self, instance_id: str) -> None:
"""Heartbeat: extend lease_expires_at by TTL from now."""
self._put(f"/instances/{instance_id}/heartbeat", {})
def mark_healthy(self, instance_id: str) -> None:
"""Transition instance from STARTING to HEALTHY after self-check passes."""
self._put(f"/instances/{instance_id}/status", {"status": "HEALTHY"})
def deregister_instance(self, instance_id: str) -> None:
"""Explicit deregistration on graceful shutdown."""
self._delete(f"/instances/{instance_id}")
def discover_instances(self, service_name: str) -> list[dict]:
"""Return list of healthy instances with host/port/metadata."""
return self._get(f"/services/{service_name}/instances")
def subscribe(self, service_name: str, callback_url: str) -> None:
"""Register a watch callback for service change events."""
self._post("/watchers", {
"service_name": service_name,
"callback_url": callback_url
})
def _get(self, path: str) -> dict:
import urllib.request, json
with urllib.request.urlopen(self.registry_url + path) as r:
return json.loads(r.read())
def _post(self, path: str, payload: dict) -> dict:
import urllib.request, json
data = json.dumps(payload).encode()
req = urllib.request.Request(self.registry_url + path, data=data,
headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req) as r:
return json.loads(r.read())
def _put(self, path: str, payload: dict) -> dict:
import urllib.request, json
data = json.dumps(payload).encode()
req = urllib.request.Request(self.registry_url + path, data=data,
headers={"Content-Type": "application/json"},
method="PUT")
with urllib.request.urlopen(req) as r:
return json.loads(r.read())
def _delete(self, path: str) -> None:
import urllib.request
req = urllib.request.Request(self.registry_url + path, method="DELETE")
urllib.request.urlopen(req)
class RoundRobinBalancer:
def __init__(self):
self._index = 0
def pick(self, instances: list[dict]) -> Optional[dict]:
if not instances:
return None
instance = instances[self._index % len(instances)]
self._index += 1
return instance
Design Trade-offs and Failure Modes
Split-brain scenarios: If the registry cluster partitions, different clients may see different instance sets. Prefer CP registries (etcd, Zookeeper) for consistency or AP (Eureka) for availability — the choice depends on whether stale endpoints or missing endpoints are more harmful for your traffic.
Client cache TTL: Short TTL (5–10 seconds) means faster convergence after instance changes but more registry load. Long TTL (60+ seconds) reduces load but delays detection of dead instances. A reasonable default is 30 seconds with watch-based invalidation.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering