System Design: Service Discovery — Consul, DNS, etcd, Eureka, Health Checking, Load Balancing, Kubernetes Services

In a microservices architecture, services must find and communicate with each other despite dynamic scaling, deployments, and failures. Service discovery is the mechanism that maps service names to network addresses. This guide covers service discovery patterns, tools (Consul, etcd, Kubernetes DNS), health checking, and how modern service meshes handle discovery — essential for system design and infrastructure interviews.

The Service Discovery Problem

In traditional infrastructure, services run on known, fixed IP addresses configured manually. In cloud-native environments: instances scale up and down (auto-scaling), containers restart on different hosts (Kubernetes pod rescheduling), deployments replace instances with new ones (rolling updates), and failures remove instances unpredictably. The IP address of a service instance changes constantly. Hardcoding IPs is impossible. Service discovery solves this: a service registers itself (name, IP, port, health status) when it starts. Other services query the registry by service name to get a list of healthy instances. Two patterns: (1) Client-side discovery — the client queries the service registry and selects an instance (with load balancing). Netflix Eureka + Ribbon use this pattern. (2) Server-side discovery — the client sends the request to a load balancer or proxy, which queries the registry and forwards to a healthy instance. AWS ELB, Kubernetes Services, and Consul Connect use this pattern. Server-side is simpler for the client (just call a stable endpoint) but adds a network hop.

DNS-Based Service Discovery

The simplest service discovery mechanism: DNS. Register service instances as DNS records. The service name resolves to the IP addresses of healthy instances. Kubernetes DNS: every Kubernetes Service gets a DNS name: service-name.namespace.svc.cluster.local. CoreDNS (running in the cluster) resolves this to the Service ClusterIP. The ClusterIP load-balances to backend pod IPs via iptables/IPVS rules. Pods discover services by DNS name without any configuration. Limitations of DNS-based discovery: (1) TTL propagation — DNS clients cache records for the TTL duration. If an instance dies, clients may still send traffic to the dead IP until the cached record expires. Short TTLs (5-30 seconds) mitigate this but increase DNS query volume. (2) No health checking — basic DNS does not health-check instances. An unhealthy instance remains in the DNS record until explicitly removed. (3) Limited load balancing — DNS round-robin distributes evenly but does not consider instance load or health. (4) No metadata — DNS returns only IP addresses. Service discovery often needs metadata (version, region, capabilities). Despite limitations, DNS-based discovery is the simplest approach and sufficient for many architectures, especially with Kubernetes handling health checks and endpoint updates.

Consul: Full-Featured Service Discovery

HashiCorp Consul provides service discovery, health checking, and a key-value store in one platform. Architecture: Consul agents run on every node. Agents register local services and perform health checks. Consul servers (3 or 5, using Raft consensus) maintain the service catalog. Service registration: when a service starts, its Consul agent registers it: name (payment-service), IP, port, tags (v2, production), and health check configuration (HTTP GET /health every 10 seconds). Health checking: the local agent periodically checks each registered service. HTTP check (GET /health, expect 200), TCP check (connect to port), script check (run a custom script), or gRPC health check. If the check fails N times, the service is marked unhealthy and removed from query results. Discovery: clients query Consul DNS (payment-service.service.consul resolves to healthy instance IPs) or the HTTP API (returns full metadata including tags, health status, and custom metadata). Consul Connect: a service mesh feature that provides mutual TLS (mTLS) between services. Consul manages certificate provisioning and rotation. Sidecar proxies (Envoy) handle encryption transparently. This adds zero-trust security to service-to-service communication.

Kubernetes Service Discovery

Kubernetes has built-in service discovery through Services and DNS. A Kubernetes Service is an abstraction that defines a logical set of pods (selected by label) and a policy for accessing them. Service types: ClusterIP (default) creates a virtual IP internal to the cluster. CoreDNS maps the Service name to this IP. kube-proxy programs iptables/IPVS rules to load-balance traffic from the ClusterIP to healthy pod IPs. The Endpoints controller watches for pods matching the Service selector and updates the endpoint list as pods start and stop. Readiness probes: a pod is only added to the Service endpoints when its readiness probe passes. This prevents traffic from reaching pods that are still starting or are unhealthy. Headless Services (clusterIP: None): return the individual pod IPs directly via DNS (DNS A records for each pod). Use for: stateful services where the client needs to connect to a specific pod (databases, Kafka brokers), and client-side load balancing. Environment variables: Kubernetes also injects Service IPs as environment variables into pods (PAYMENT_SERVICE_HOST, PAYMENT_SERVICE_PORT). Less flexible than DNS but does not require DNS lookup.

Health Checking Best Practices

Health checks determine whether a service instance can receive traffic. Two types: (1) Liveness check — “is the process alive?” Answers: is the HTTP server responding? Is the process stuck in a deadlock? Failure action: restart the instance. Implement as: GET /health/live returns 200 if the process is running. Do not check dependencies here — a database outage should not restart the application (the restart will not fix the database). (2) Readiness check — “is the instance ready to serve traffic?” Answers: is the database connection pool established? Is the cache warmed? Are required configuration files loaded? Failure action: stop routing traffic to this instance (but do not restart). Implement as: GET /health/ready returns 200 if all dependencies are connected and the application is fully initialized. Health check parameters: interval (how often to check, 5-10 seconds), timeout (how long to wait for a response, 2-3 seconds), unhealthy threshold (consecutive failures before marking unhealthy, 3), healthy threshold (consecutive successes before marking healthy, 2). The healthy threshold prevents flapping: an instance that passes one check after failing should not immediately receive full traffic. Dependency checks: the readiness endpoint should verify critical dependencies (database, cache) but with a short timeout. A 3-second database health check should not take 30 seconds when the database is slow.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is service discovery and why is it needed in microservices?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In microservices, service instances have dynamic IP addresses. Instances scale up/down (auto-scaling), containers restart on different hosts (Kubernetes rescheduling), deployments replace instances (rolling updates), and failures remove instances. Hardcoding IPs is impossible. Service discovery maps service names to network addresses dynamically. A service registers (name, IP, port, health) on startup. Clients query the registry by name to get healthy instance addresses. Two patterns: client-side discovery (client queries registry, selects instance — Netflix Eureka) and server-side discovery (client calls a stable endpoint, a proxy/LB forwards to a healthy instance — Kubernetes Services, AWS ELB). Server-side is simpler for clients (just one DNS name) but adds a network hop.”}},{“@type”:”Question”,”name”:”How does Kubernetes service discovery work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Kubernetes provides built-in service discovery via Services and DNS. A Service selects pods by label and creates a stable endpoint. CoreDNS maps service-name.namespace.svc.cluster.local to the Service ClusterIP. kube-proxy programs iptables/IPVS rules to load-balance traffic from ClusterIP to healthy pod IPs. The Endpoints controller automatically updates the backend pod list as pods start and stop. Readiness probes: pods are only added to Service endpoints when their readiness probe passes, preventing traffic to starting or unhealthy pods. Headless Services (clusterIP: None): DNS returns individual pod IPs directly. Use for stateful services (databases, Kafka) where clients need specific pod connections. No additional service discovery tool (Consul, Eureka) is needed in Kubernetes — the platform handles registration, health checking, and DNS resolution natively.”}},{“@type”:”Question”,”name”:”What is the difference between liveness and readiness health checks?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Liveness check answers: is the process alive and not stuck? Check: GET /health/live returns 200 if the HTTP server responds. Do NOT check external dependencies (database, cache). Failure action: restart the instance (kill and recreate the pod). A database outage should not cause all application pods to restart — that makes things worse. Readiness check answers: is the instance ready to serve user traffic? Check: GET /health/ready returns 200 if database connection pool is established, cache is connected, and initialization is complete. Failure action: stop routing traffic to this instance (remove from Service endpoints) but do NOT restart. A readiness failure during a database outage correctly stops traffic without unnecessary restarts. The instance becomes ready again when the database recovers. Critical mistake: using the same endpoint for both. If the readiness check includes database connectivity and is used as a liveness check, a database outage causes all pods to restart in a loop.”}},{“@type”:”Question”,”name”:”When should you use Consul versus Kubernetes built-in service discovery?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use Kubernetes built-in discovery when: all services run in the same Kubernetes cluster, the built-in DNS and Service abstraction meet your needs, and you want zero additional infrastructure. This covers most Kubernetes-native applications. Use Consul when: (1) Services span multiple environments — some in Kubernetes, some on VMs, some on bare metal. Consul provides a unified service registry across all platforms. (2) You need a service mesh without Istio — Consul Connect provides mTLS and service-to-service authorization. (3) You need a distributed key-value store for configuration alongside service discovery. (4) You operate across multiple Kubernetes clusters or cloud providers — Consul federation connects service registries across clusters and regions. (5) You need richer health checking than Kubernetes provides (custom script checks, multi-step health verification). For a single Kubernetes cluster with only containerized services, Kubernetes DNS is sufficient and simpler. Add Consul when your architecture grows beyond a single cluster or includes non-Kubernetes services.”}}]}
Scroll to Top