Service discovery lets microservices find each other without hardcoded IP addresses. In dynamic environments where containers start, stop, and move, the network location of any service instance changes constantly. Service discovery solves this with a registry that maps service names to live instance addresses, updated in real time as instances come and go. The two patterns are client-side discovery (the caller queries the registry) and server-side discovery (a load balancer queries the registry on the caller’s behalf).
Core Data Model (Registry)
-- Service registry table (simplified; production uses Consul or etcd)
CREATE TABLE ServiceInstance (
instance_id VARCHAR(100) PRIMARY KEY, -- 'payment-service-pod-a3f9'
service_name VARCHAR(100) NOT NULL, -- 'payment-service'
host VARCHAR(255) NOT NULL,
port INT NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'healthy', -- healthy, unhealthy, starting
metadata JSONB NOT NULL DEFAULT '{}', -- version, region, tags
last_heartbeat TIMESTAMPTZ NOT NULL DEFAULT NOW(),
registered_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_si_service_status ON ServiceInstance(service_name, status);
-- Deregister stale instances: health check daemon removes instances
-- where last_heartbeat < NOW() - INTERVAL '30 seconds'
Registration, Heartbeat, and Deregistration
class ServiceRegistry:
def __init__(self, registry_url: str, service_name: str, host: str, port: int):
self.registry_url = registry_url
self.instance_id = f"{service_name}-{host}-{port}-{uuid4().hex[:8]}"
self.service_name = service_name
self.host = host
self.port = port
def register(self, metadata: dict = None):
requests.post(f"{self.registry_url}/register", json={
'instance_id': self.instance_id,
'service_name': self.service_name,
'host': self.host,
'port': self.port,
'metadata': metadata or {}
})
# Start background heartbeat thread
threading.Thread(target=self._heartbeat_loop, daemon=True).start()
# Deregister on shutdown
atexit.register(self.deregister)
def _heartbeat_loop(self, interval_seconds: int = 10):
while True:
try:
requests.put(f"{self.registry_url}/heartbeat/{self.instance_id}")
except Exception:
pass # heartbeat failure is non-fatal — registry will expire us
time.sleep(interval_seconds)
def deregister(self):
requests.delete(f"{self.registry_url}/instances/{self.instance_id}")
Client-Side Discovery with Local Cache
class ServiceClient:
def __init__(self, registry_url: str, cache_ttl_seconds: int = 30):
self.registry_url = registry_url
self.cache: dict[str, tuple[list, float]] = {} # service_name -> (instances, expiry)
self.lock = threading.Lock()
def get_instances(self, service_name: str) -> list[dict]:
with self.lock:
cached = self.cache.get(service_name)
if cached and time.monotonic() dict:
idx = self._counters.get(service_name, 0)
self._counters[service_name] = (idx + 1) % len(instances)
return instances[idx]
Server-Side Discovery (Kubernetes Pattern)
In Kubernetes, service discovery is built into the platform. A Service object provides a stable DNS name (payment-service.default.svc.cluster.local) and a virtual IP. kube-proxy maintains iptables rules that load-balance TCP connections to healthy pods. The caller needs no registry client — DNS resolution returns the virtual IP, and the kernel handles load balancing transparently. This is why Kubernetes-native services rarely need client-side registry libraries.
# Kubernetes Service manifest creates DNS-based discovery
apiVersion: v1
kind: Service
metadata:
name: payment-service
spec:
selector:
app: payment-service # matches all pods with this label
ports:
- port: 8080
targetPort: 8080
type: ClusterIP # stable virtual IP within the cluster
---
# Caller simply uses the DNS name -- no registry client needed
PAYMENT_SERVICE_URL = "http://payment-service:8080"
Health Check Patterns
The registry must distinguish between instances that are running but unhealthy (returning 500s or timing out) vs instances that are starting up. Three check types: (1) HTTP health endpoint: GET /health returns 200 if healthy. The registry polls every 10 seconds; three consecutive failures mark the instance unhealthy and remove it from the rotation. (2) TTL heartbeat: the instance pushes a heartbeat to the registry. No heartbeat within 30 seconds = stale, deregister. (3) TCP check: the registry opens a TCP connection to the service port. Connection refused = unhealthy. Use HTTP checks for application health (DB connected, dependencies up); use TTL heartbeats for services behind NAT or firewalls.
Key Interview Points
- Always cache registry lookups locally (30-60 second TTL) — querying the registry on every request adds latency and creates a single point of failure.
- The local cache means brief inconsistency is acceptable: an instance may appear healthy in cache for up to 30 seconds after it fails. Circuit breakers on the client side catch these stale entries faster.
- Consul, etcd, and ZooKeeper provide production-grade service registries with Raft-based consistency, watch notifications (instead of polling), and health check integration.
- In Kubernetes, prefer built-in DNS service discovery over external registries — it is simpler, integrates with liveness/readiness probes, and handles rollout/rollback automatically.
- Graceful deregistration matters: on SIGTERM, deregister first, then drain in-flight requests (10-30 second drain period), then exit. Without this, the registry routes traffic to a shutting-down instance for up to one health check interval.
Service discovery and microservices architecture design is discussed in Uber system design interview questions.
Service discovery and distributed systems design is covered in Netflix system design interview preparation.
Service discovery and microservices routing design is discussed in Lyft system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering