Service Discovery System: Low-Level Design

⏱ 4 min read

Service discovery allows microservices to find each other’s network addresses without hardcoded configuration. In a dynamic environment where services scale up and down, deploy to new IPs, or fail, hardcoded addresses become stale. Service discovery provides a registry where services register themselves and clients look up addresses dynamically. Consul, etcd, Eureka (Netflix), and Kubernetes DNS are common implementations.

Client-Side vs. Server-Side Discovery

Client-side discovery: the client queries the service registry directly, gets a list of available instances for the target service, and applies its own load balancing algorithm (round-robin, least-connections). Netflix Ribbon uses client-side discovery. Advantages: no discovery hop latency — the client routes directly to the chosen instance. Load balancing logic lives in the client library. Disadvantages: the client library must implement load balancing and health check awareness. Different clients (Java, Go, Python) must all implement the same logic — complex in polyglot environments. Server-side discovery: the client sends requests to a load balancer (Nginx, AWS ALB, Envoy). The load balancer queries the service registry and routes to a healthy instance. The client knows only the load balancer’s address — completely unaware of individual instances. Advantages: clients are simple; load balancing logic is centralized. Disadvantages: an extra network hop (client → load balancer → instance); the load balancer can be a bottleneck and SPOF. Service mesh (Envoy sidecar): each service instance has a sidecar proxy (Envoy, Linkerd) that handles discovery, load balancing, retries, and observability. The application talks to localhost; the sidecar handles all service-to-service networking. Istio uses this architecture — it’s powerful but operationally complex.

Health Checks and Registration

Service registration: when a service instance starts, it registers itself in the registry with: service name (user-service), instance ID (UUID), host, port, health check URL (/health), and metadata (version, datacenter, tags). Registration methods: (1) Self-registration: the service registers itself on startup via API call to the registry. Deregisters on graceful shutdown. Problem: if the process crashes without deregistering, the stale entry remains until TTL expiry. (2) External registration: a sidecar or orchestration platform (Kubernetes) registers and deregisters services on behalf of the application. The service has no registry dependency. Health checks: the registry periodically probes each registered instance’s health check endpoint. Types: HTTP health check (GET /health → 200 OK), TCP check (can connect to the port), TTL check (service must send a heartbeat every N seconds or be marked unhealthy). Consul supports all three. Unhealthy instances are removed from the pool — clients only receive healthy instances. Health check interval: 10-30 seconds is typical. A failed instance is detected within one health check interval + time for the registry to update clients (usually < 30 seconds total).

Consensus and Consistency

The service registry is critical infrastructure — if it’s down, services can’t discover each other. The registry must be: highly available (replicated across multiple nodes), consistently readable (a just-registered instance should be immediately discoverable), and fault-tolerant (a minority of registry nodes can fail without affecting service). Consul uses the Raft consensus algorithm with a quorum of nodes. A 3-node Consul cluster tolerates 1 node failure; a 5-node cluster tolerates 2 failures. Reads and writes go through the leader and are replicated to followers. Strong consistency reads (quorum reads) ensure the most recent registry state is returned — no stale entries. Performance-optimized reads: Consul supports stale reads (from any node, without quorum) — slightly stale but much faster. Clients can tolerate brief staleness for service discovery (a 100ms-stale list of instances is fine if health checks run every 10 seconds). DNS-based discovery: Consul also exposes a DNS interface — user-service.service.consul. DNS caching in the client OS can cause issues; use a short TTL (5-10 seconds). Kubernetes CoreDNS provides service DNS natively — my-service.namespace.svc.cluster.local resolves to the service’s ClusterIP.

Client-Side Caching and Circuit Breaking

Querying the service registry on every request is too slow and puts unnecessary load on the registry. Clients cache the instance list and refresh periodically. Cache TTL: 5-30 seconds. A 30-second TTL means a new instance becomes available to clients within 30 seconds of registration; a failed instance is removed within 30 seconds of the health check detecting failure. Watch mechanism: Consul and etcd support long-polling watches — the client registers a watch on a service key and receives a push notification when the list changes. This enables near-instant cache invalidation without polling: clients update their instance list within milliseconds of a registration change. Circuit breaker integration: if a specific instance consistently fails (connection refused, 5xx), the client marks it as unhealthy locally and stops routing to it without waiting for the registry’s health check to catch up. Netflix Hystrix and Resilience4j implement circuit breakers on the client side. The local circuit state is temporary — on the next cache refresh, the registry’s health check result overrides the local state.

Kubernetes Service Discovery

Kubernetes provides built-in service discovery via kube-dns (CoreDNS) and Services. A Kubernetes Service is a stable virtual IP (ClusterIP) that load-balances across healthy pod instances. DNS: pods can reach other services by name (my-service.my-namespace.svc.cluster.local). CoreDNS resolves this to the Service’s ClusterIP. kube-proxy (running on each node) uses iptables or IPVS rules to NAT connections to the ClusterIP to one of the healthy pod IPs. This is server-side discovery — pods don’t need a registry client library; they just connect to the service hostname. Headless services: set clusterIP: None to bypass the virtual IP. DNS returns the actual pod IPs directly — enables client-side load balancing (useful for stateful services like databases). Endpoints and EndpointSlices: Kubernetes maintains an Endpoints object per Service listing the current pod IPs. When a pod becomes unhealthy (readiness probe fails), kube-controller-manager removes it from the Endpoints — traffic stops reaching the unhealthy pod within seconds. This is the equivalent of health check deregistration in Consul.