In a microservices architecture, services must find and communicate with each other despite dynamic scaling, deployments, and failures. Service discovery is the mechanism that maps service names to network addresses. This guide covers service discovery patterns, tools (Consul, etcd, Kubernetes DNS), health checking, and how modern service meshes handle discovery — essential for system design and infrastructure interviews.
The Service Discovery Problem
In traditional infrastructure, services run on known, fixed IP addresses configured manually. In cloud-native environments: instances scale up and down (auto-scaling), containers restart on different hosts (Kubernetes pod rescheduling), deployments replace instances with new ones (rolling updates), and failures remove instances unpredictably. The IP address of a service instance changes constantly. Hardcoding IPs is impossible. Service discovery solves this: a service registers itself (name, IP, port, health status) when it starts. Other services query the registry by service name to get a list of healthy instances. Two patterns: (1) Client-side discovery — the client queries the service registry and selects an instance (with load balancing). Netflix Eureka + Ribbon use this pattern. (2) Server-side discovery — the client sends the request to a load balancer or proxy, which queries the registry and forwards to a healthy instance. AWS ELB, Kubernetes Services, and Consul Connect use this pattern. Server-side is simpler for the client (just call a stable endpoint) but adds a network hop.
DNS-Based Service Discovery
The simplest service discovery mechanism: DNS. Register service instances as DNS records. The service name resolves to the IP addresses of healthy instances. Kubernetes DNS: every Kubernetes Service gets a DNS name: service-name.namespace.svc.cluster.local. CoreDNS (running in the cluster) resolves this to the Service ClusterIP. The ClusterIP load-balances to backend pod IPs via iptables/IPVS rules. Pods discover services by DNS name without any configuration. Limitations of DNS-based discovery: (1) TTL propagation — DNS clients cache records for the TTL duration. If an instance dies, clients may still send traffic to the dead IP until the cached record expires. Short TTLs (5-30 seconds) mitigate this but increase DNS query volume. (2) No health checking — basic DNS does not health-check instances. An unhealthy instance remains in the DNS record until explicitly removed. (3) Limited load balancing — DNS round-robin distributes evenly but does not consider instance load or health. (4) No metadata — DNS returns only IP addresses. Service discovery often needs metadata (version, region, capabilities). Despite limitations, DNS-based discovery is the simplest approach and sufficient for many architectures, especially with Kubernetes handling health checks and endpoint updates.
Consul: Full-Featured Service Discovery
HashiCorp Consul provides service discovery, health checking, and a key-value store in one platform. Architecture: Consul agents run on every node. Agents register local services and perform health checks. Consul servers (3 or 5, using Raft consensus) maintain the service catalog. Service registration: when a service starts, its Consul agent registers it: name (payment-service), IP, port, tags (v2, production), and health check configuration (HTTP GET /health every 10 seconds). Health checking: the local agent periodically checks each registered service. HTTP check (GET /health, expect 200), TCP check (connect to port), script check (run a custom script), or gRPC health check. If the check fails N times, the service is marked unhealthy and removed from query results. Discovery: clients query Consul DNS (payment-service.service.consul resolves to healthy instance IPs) or the HTTP API (returns full metadata including tags, health status, and custom metadata). Consul Connect: a service mesh feature that provides mutual TLS (mTLS) between services. Consul manages certificate provisioning and rotation. Sidecar proxies (Envoy) handle encryption transparently. This adds zero-trust security to service-to-service communication.
Kubernetes Service Discovery
Kubernetes has built-in service discovery through Services and DNS. A Kubernetes Service is an abstraction that defines a logical set of pods (selected by label) and a policy for accessing them. Service types: ClusterIP (default) creates a virtual IP internal to the cluster. CoreDNS maps the Service name to this IP. kube-proxy programs iptables/IPVS rules to load-balance traffic from the ClusterIP to healthy pod IPs. The Endpoints controller watches for pods matching the Service selector and updates the endpoint list as pods start and stop. Readiness probes: a pod is only added to the Service endpoints when its readiness probe passes. This prevents traffic from reaching pods that are still starting or are unhealthy. Headless Services (clusterIP: None): return the individual pod IPs directly via DNS (DNS A records for each pod). Use for: stateful services where the client needs to connect to a specific pod (databases, Kafka brokers), and client-side load balancing. Environment variables: Kubernetes also injects Service IPs as environment variables into pods (PAYMENT_SERVICE_HOST, PAYMENT_SERVICE_PORT). Less flexible than DNS but does not require DNS lookup.
Health Checking Best Practices
Health checks determine whether a service instance can receive traffic. Two types: (1) Liveness check — “is the process alive?” Answers: is the HTTP server responding? Is the process stuck in a deadlock? Failure action: restart the instance. Implement as: GET /health/live returns 200 if the process is running. Do not check dependencies here — a database outage should not restart the application (the restart will not fix the database). (2) Readiness check — “is the instance ready to serve traffic?” Answers: is the database connection pool established? Is the cache warmed? Are required configuration files loaded? Failure action: stop routing traffic to this instance (but do not restart). Implement as: GET /health/ready returns 200 if all dependencies are connected and the application is fully initialized. Health check parameters: interval (how often to check, 5-10 seconds), timeout (how long to wait for a response, 2-3 seconds), unhealthy threshold (consecutive failures before marking unhealthy, 3), healthy threshold (consecutive successes before marking healthy, 2). The healthy threshold prevents flapping: an instance that passes one check after failing should not immediately receive full traffic. Dependency checks: the readiness endpoint should verify critical dependencies (database, cache) but with a short timeout. A 3-second database health check should not take 30 seconds when the database is slow.