DNS Load Balancer Low-Level Design: Round-Robin DNS, Health-Aware Record Updates, and TTL Trade-offs

What Is DNS Load Balancing?

DNS load balancing distributes client traffic across multiple servers by returning different IP addresses for the same hostname. When a client resolves api.example.com, the DNS response contains one or more A records pointing to backend servers. The simplest form is round-robin: the DNS server cycles through a list of IPs, returning them in rotation so successive clients land on different backends.

Round-Robin DNS and Its Limits

Round-robin works by maintaining multiple A records for one hostname. Each DNS response shuffles or rotates the list. Clients typically connect to the first IP returned. This distributes new connections roughly equally but has no awareness of server load, response time, or health. A server that is down still appears in the rotation until DNS is updated.

DNS round-robin also cannot provide session stickiness. Because DNS caches the response, a returning client may resolve to a different IP on the next lookup, breaking any server-side session state. For stateless APIs this is acceptable; for stateful workloads it is a fundamental mismatch.

TTL Trade-offs: Failover Speed vs Cache Churn

Time-to-live (TTL) on DNS records controls how long resolvers and clients cache an answer. Short TTLs (30–60 seconds) allow rapid failover: when a server goes down, you update DNS and clients re-query within seconds. The cost is high query volume against authoritative nameservers and increased latency on every cache miss. Long TTLs (300–3600 seconds) reduce query load but mean clients keep stale IP addresses after a failure. A server removed from DNS is still reachable by clients whose cache has not expired.

DNS-based failover latency is therefore: TTL remaining + client cache TTL. In the worst case, a client that just cached a now-dead IP will keep sending traffic there for the full TTL window. This makes DNS a poor choice when you need sub-second failover.

Health-Check-Driven DNS Updates

Health-aware DNS removes unhealthy IP addresses from responses when a server fails. A health checker (running inside the DNS provider or your own poller) continuously probes each backend via HTTP, TCP, or ICMP. On failure detection it removes that IP from the DNS record set and publishes an updated response. On recovery it re-adds the IP.

The propagation delay problem remains: even after the DNS record is updated, resolvers that already cached the old answer will not re-query until their cached TTL expires. Negative TTL (the SOA minimum field) controls how long NXDOMAIN responses are cached, but it does not help with stale positive records. The only mitigation is a low TTL — which brings its own costs.

Weighted Routing

Weighted DNS routing returns a given IP more frequently in proportion to its assigned weight. If server A has weight 3 and server B has weight 1, roughly 75% of DNS responses include server A's IP first. This is useful for canary deploys (send 5% of new clients to the new version) or for shifting traffic away from an underpowered host. Implementation: the authoritative DNS server uses a weighted random selection across the IP set when building each response.

GeoDNS and Anycast Routing

GeoDNS returns different IP addresses based on the geographic location of the DNS resolver making the request. A client in Europe resolves api.example.com to the Frankfurt datacenter IP; a client in California gets the US-West IP. This reduces latency by routing each user to the nearest region. Accuracy depends on the resolver's location matching the client's location, which is imperfect when users use public resolvers (8.8.8.8, 1.1.1.1) far from their actual location. EDNS Client Subnet (ECS) partially addresses this: the resolver forwards a prefix of the client's IP to the authoritative server, allowing more accurate geolocation.

Anycast routing is a complementary technique: multiple servers in different locations announce the same IP prefix via BGP. Routing infrastructure automatically delivers packets to the topologically nearest instance. Anycast operates at the network layer rather than the DNS layer, so it has no TTL-related failover lag. It is commonly used for authoritative DNS servers themselves and for CDN edge nodes.

DNS Load Balancing vs L4/L7 Load Balancers

DNS load balancing sits entirely outside the data path. It influences which server a client initially connects to, but once the connection is established, all traffic flows directly between client and server. L4 load balancers (TCP/UDP) sit in the data path, terminating or proxying connections — they can enforce connection limits, perform real-time health checks per connection, and provide instant failover without TTL lag. L7 load balancers additionally inspect HTTP headers, URLs, and cookies, enabling session stickiness, request routing by path, and SSL termination.

Use Cases and Limitations

DNS load balancing is well-suited for global traffic distribution across multiple regions in an active-active architecture. It scales to any number of clients without adding infrastructure, since the DNS system itself is the distributor. It works well when TTL-based failover latency is acceptable and stateless services are involved.

Key limitations to communicate in an interview:

No session stickiness — a client can be directed to a different server on each DNS lookup
No real-time health propagation — failover speed is bounded by TTL
No connection-level load awareness — DNS cannot see how many active connections each server has
GeoDNS accuracy — resolver location may not match client location without ECS
Client caching behavior — OS and browser DNS caches may ignore TTL minimums

For multi-region active-active deployments where latency-based routing and broad geographic distribution matter more than per-connection precision, DNS load balancing combined with GeoDNS and health-check-driven record updates is a practical and low-cost solution. Pair it with an L7 load balancer within each region for connection-level control.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does round-robin DNS distribute traffic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Round-robin DNS returns multiple A records for a hostname in rotating order, so successive resolver queries receive different IP addresses, spreading client connections across a pool of servers. Because DNS caching at the resolver and client layers can pin a client to one IP for the duration of the TTL, round-robin provides rough distribution rather than precise per-request load balancing.”
}
},
{
“@type”: “Question”,
“name”: “Why is TTL selection critical for DNS-based failover?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A short TTL (30–60 seconds) allows unhealthy records to be removed quickly and clients to pick up new IPs after a failover, but it increases resolver query volume and can amplify latency at high request rates. A long TTL reduces DNS overhead but means clients may continue directing traffic to a failed host for minutes or hours before their cached record expires.”
}
},
{
“@type”: “Question”,
“name”: “How does GeoDNS route users to the nearest datacenter?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “GeoDNS authoritative servers inspect the EDNS Client Subnet (ECS) extension or the resolver's IP to determine the client's approximate geography, then return the A/AAAA records for the nearest point of presence. The geo-to-IP mapping database must be continuously updated because IP block assignments shift as ISPs and cloud providers reallocate address space.”
}
},
{
“@type”: “Question”,
“name”: “What are the limitations of DNS load balancing compared to L7 load balancers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DNS load balancing cannot inspect HTTP headers, cookies, or request paths, so it cannot implement session affinity, content-based routing, or per-request health-weighted distribution. Additionally, DNS caching in resolvers and operating systems means traffic cannot be instantly drained from a host, making zero-downtime deployments harder without a layer-7 proxy in front.”
}
}
]
}