Service discovery lets microservices find each other without hardcoded IP addresses. In dynamic environments where containers start, stop, and move, the network location of any service instance changes constantly. Service discovery solves this with a registry that maps service names to live instance addresses, updated in real time as instances come and go. The two patterns are client-side discovery (the caller queries the registry) and server-side discovery (a load balancer queries the registry on the caller’s behalf).
Core Data Model (Registry)
-- Service registry table (simplified; production uses Consul or etcd)
CREATE TABLE ServiceInstance (
instance_id VARCHAR(100) PRIMARY KEY, -- 'payment-service-pod-a3f9'
service_name VARCHAR(100) NOT NULL, -- 'payment-service'
host VARCHAR(255) NOT NULL,
port INT NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'healthy', -- healthy, unhealthy, starting
metadata JSONB NOT NULL DEFAULT '{}', -- version, region, tags
last_heartbeat TIMESTAMPTZ NOT NULL DEFAULT NOW(),
registered_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_si_service_status ON ServiceInstance(service_name, status);
-- Deregister stale instances: health check daemon removes instances
-- where last_heartbeat < NOW() - INTERVAL '30 seconds'
Registration, Heartbeat, and Deregistration
class ServiceRegistry:
def __init__(self, registry_url: str, service_name: str, host: str, port: int):
self.registry_url = registry_url
self.instance_id = f"{service_name}-{host}-{port}-{uuid4().hex[:8]}"
self.service_name = service_name
self.host = host
self.port = port
def register(self, metadata: dict = None):
requests.post(f"{self.registry_url}/register", json={
'instance_id': self.instance_id,
'service_name': self.service_name,
'host': self.host,
'port': self.port,
'metadata': metadata or {}
})
# Start background heartbeat thread
threading.Thread(target=self._heartbeat_loop, daemon=True).start()
# Deregister on shutdown
atexit.register(self.deregister)
def _heartbeat_loop(self, interval_seconds: int = 10):
while True:
try:
requests.put(f"{self.registry_url}/heartbeat/{self.instance_id}")
except Exception:
pass # heartbeat failure is non-fatal — registry will expire us
time.sleep(interval_seconds)
def deregister(self):
requests.delete(f"{self.registry_url}/instances/{self.instance_id}")
Client-Side Discovery with Local Cache
class ServiceClient:
def __init__(self, registry_url: str, cache_ttl_seconds: int = 30):
self.registry_url = registry_url
self.cache: dict[str, tuple[list, float]] = {} # service_name -> (instances, expiry)
self.lock = threading.Lock()
def get_instances(self, service_name: str) -> list[dict]:
with self.lock:
cached = self.cache.get(service_name)
if cached and time.monotonic() dict:
idx = self._counters.get(service_name, 0)
self._counters[service_name] = (idx + 1) % len(instances)
return instances[idx]
Server-Side Discovery (Kubernetes Pattern)
In Kubernetes, service discovery is built into the platform. A Service object provides a stable DNS name (payment-service.default.svc.cluster.local) and a virtual IP. kube-proxy maintains iptables rules that load-balance TCP connections to healthy pods. The caller needs no registry client — DNS resolution returns the virtual IP, and the kernel handles load balancing transparently. This is why Kubernetes-native services rarely need client-side registry libraries.
# Kubernetes Service manifest creates DNS-based discovery
apiVersion: v1
kind: Service
metadata:
name: payment-service
spec:
selector:
app: payment-service # matches all pods with this label
ports:
- port: 8080
targetPort: 8080
type: ClusterIP # stable virtual IP within the cluster
---
# Caller simply uses the DNS name -- no registry client needed
PAYMENT_SERVICE_URL = "http://payment-service:8080"
Health Check Patterns
The registry must distinguish between instances that are running but unhealthy (returning 500s or timing out) vs instances that are starting up. Three check types: (1) HTTP health endpoint: GET /health returns 200 if healthy. The registry polls every 10 seconds; three consecutive failures mark the instance unhealthy and remove it from the rotation. (2) TTL heartbeat: the instance pushes a heartbeat to the registry. No heartbeat within 30 seconds = stale, deregister. (3) TCP check: the registry opens a TCP connection to the service port. Connection refused = unhealthy. Use HTTP checks for application health (DB connected, dependencies up); use TTL heartbeats for services behind NAT or firewalls.
Key Interview Points
- Always cache registry lookups locally (30-60 second TTL) — querying the registry on every request adds latency and creates a single point of failure.
- The local cache means brief inconsistency is acceptable: an instance may appear healthy in cache for up to 30 seconds after it fails. Circuit breakers on the client side catch these stale entries faster.
- Consul, etcd, and ZooKeeper provide production-grade service registries with Raft-based consistency, watch notifications (instead of polling), and health check integration.
- In Kubernetes, prefer built-in DNS service discovery over external registries — it is simpler, integrates with liveness/readiness probes, and handles rollout/rollback automatically.
- Graceful deregistration matters: on SIGTERM, deregister first, then drain in-flight requests (10-30 second drain period), then exit. Without this, the registry routes traffic to a shutting-down instance for up to one health check interval.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between client-side and server-side service discovery?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Client-side discovery: the calling service queries the service registry directly, receives a list of healthy instances, and chooses one using its own load-balancing logic (round-robin, least-connections). The client owns the selection. Pros: one less hop, client can apply sophisticated routing (prefer same-region instances, circuit-break unhealthy ones). Cons: every client needs a registry client library; discovery logic is duplicated across all services. Server-side discovery: the client sends requests to a stable virtual IP or DNS name; a load balancer or proxy (Nginx, HAProxy, Kubernetes Service) queries the registry and forwards to a healthy instance. The client is completely decoupled from registry mechanics. Pros: language-agnostic, no client library needed. Cons: extra network hop through the proxy. Kubernetes uses server-side discovery (kube-proxy handles it transparently).”}},{“@type”:”Question”,”name”:”How does Kubernetes service discovery work under the hood?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When you create a Kubernetes Service (type: ClusterIP), the control plane: (1) Assigns a stable virtual IP (ClusterIP) from the service IP range. (2) Creates a DNS record: <service-name>.<namespace>.svc.cluster.local resolves to the ClusterIP. (3) Programs kube-proxy (running on every node) with iptables or IPVS rules that intercept traffic to the ClusterIP and DNAT it to a randomly selected healthy pod IP. Pods are selected by the Service’s label selector; Endpoints are updated automatically as pods start and stop. The DNS TTL is very short (5-30 seconds) so clients pick up endpoint changes quickly. No separate registry process — the API server is the registry, etcd is the store, and kube-proxy is the load balancer.”}},{“@type”:”Question”,”name”:”How do you handle the case where a cached service instance is stale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The client-side cache has a TTL (typically 30-60 seconds) to reduce registry load. During that window, a cached instance may have died. Two mitigations: (1) Circuit breaker: track consecutive failures per instance. After N failures, mark that instance as degraded and stop routing to it — independently of cache expiry. This catches failures within seconds rather than waiting for the cache TTL. (2) Retry with cache invalidation: on connection failure, immediately evict the stale instance from the local cache and re-fetch the registry. Retry the request on a different instance. Combined, these make the effective failover time seconds rather than the full cache TTL.”}},{“@type”:”Question”,”name”:”How does graceful deregistration prevent routing to a shutting-down instance?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Without graceful deregistration: (1) SIGTERM sent to instance. (2) Instance stops accepting connections. (3) Registry’s health check fails after one or two intervals (10-20 seconds). (4) Registry removes the instance. During steps 2-4, the load balancer or client still routes to the instance, causing connection refused errors. With graceful deregistration: (1) SIGTERM received. (2) Instance immediately calls DELETE /registry/instances/{id}. (3) Instance waits for in-flight requests to complete (drain period, e.g., 15 seconds). (4) Instance exits. The drain period ensures in-flight requests complete before shutdown. New requests stop routing to the instance immediately after deregistration.”}},{“@type”:”Question”,”name”:”When should you use Consul vs etcd vs Kubernetes for service discovery?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Consul: best for heterogeneous environments (VMs, containers, bare metal, multiple clouds). Built-in health checking, DNS interface, service mesh (Consul Connect), and ACL-based access control. Use when you have non-Kubernetes workloads or need cross-cloud service discovery. etcd: low-level key-value store used by Kubernetes internally. Not a service registry by itself — requires application code to implement registration and watch semantics. Use only if building a custom orchestration system. Kubernetes Services: best for workloads already running in Kubernetes. Zero setup, automatic health check integration (readiness probes control endpoint membership), and no additional infrastructure. Use for greenfield microservices in a Kubernetes cluster. The general rule: if you’re running in Kubernetes, use Kubernetes Services; if you have mixed infrastructure, use Consul.”}}]}
Service discovery and microservices architecture design is discussed in Uber system design interview questions.
Service discovery and distributed systems design is covered in Netflix system design interview preparation.
Service discovery and microservices routing design is discussed in Lyft system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering