What Does a DNS Resolver Do?
The Domain Name System (DNS) translates human-readable domain names (www.example.com) into IP addresses (93.184.216.34). A DNS resolver is the client-side component that performs this translation by querying a hierarchy of DNS servers. Understanding DNS is essential for system design: it underpins all internet communication, affects latency, and is the mechanism behind load balancing, CDN routing, and failover.
DNS Resolution Hierarchy
Client query: "What is the IP of www.example.com?"
1. Check local DNS cache (OS + browser) → HIT: return immediately
2. Query Recursive Resolver (provided by ISP or 8.8.8.8)
a. Recursive resolver checks its cache → MISS
b. Query Root Name Server ("Who handles .com?") → returns TLD NS address
c. Query TLD Name Server (.com) ("Who handles example.com?") → returns Authoritative NS
d. Query Authoritative Name Server for example.com → returns A record: 93.184.216.34
e. Recursive resolver caches result (TTL from record)
3. Return IP to client; client caches (TTL)
Full resolution (cache miss): 4 round trips, 50-200ms. Cached resolution: <1ms.
DNS Record Types
- A: maps hostname to IPv4 address. api.example.com → 1.2.3.4
- AAAA: maps hostname to IPv6 address
- CNAME: canonical name alias. www.example.com → example.com (follow the chain)
- MX: mail exchange — which servers handle email for the domain
- TXT: arbitrary text — used for SPF, DKIM, domain verification
- NS: authoritative name servers for the domain
- SOA: Start of Authority — primary NS, admin email, serial number, refresh intervals
- SRV: service location — host + port for a specific service (used by SIP, XMPP)
TTL and Caching Strategy
TTL (Time-to-Live) controls how long a DNS record is cached. Tradeoffs: short TTL (30s–5min) enables fast failover and DNS-based load balancing updates. Long TTL (1h–24h) reduces DNS lookup latency and load on name servers. Production guidelines:
- Stable records (CDN origins, mail servers): TTL=3600 (1 hour)
- Before planned changes: lower TTL to 60s a day in advance
- After changes propagate: restore TTL to 3600
- Internal service discovery: TTL=30s for fast failover
DNS-Based Load Balancing
Multiple A records for the same hostname with round-robin resolution. api.example.com → [1.2.3.4, 5.6.7.8]. Clients receive different IPs on successive queries. Simple but limited: no health checks (DNS cannot detect a dead server), no session affinity, TTL-limited update speed. Used for broad traffic distribution. For production: combine with a load balancer (DNS points to LB VIP; LB does health checking and routing).
GeoDNS
Return different DNS answers based on client location. Client in US → 1.2.3.4 (US datacenter). Client in EU → 5.6.7.8 (EU datacenter). Implemented at the authoritative name server using client subnet extension (EDNS0 Client Subnet). Used by CDNs and multi-region architectures. TTL must be short (60s) to enable fast re-routing on failover.
DNS Caching Design (Recursive Resolver)
class DNSCache:
def __init__(self):
self.cache = {} # {(name, type): (records, expiry)}
def get(self, name, rtype):
key = (name, rtype)
if key in self.cache:
records, expiry = self.cache[key]
if time.time() < expiry:
return records # cache hit
del self.cache[key] # expired
return None # cache miss
def put(self, name, rtype, records, ttl):
self.cache[(name, rtype)] = (records, time.time() + ttl)
Production DNS caches use a hash table with LRU eviction. Negative caching (NXDOMAIN): cache the non-existence of a record for the SOA TTL. Prevents repeated queries for non-existent domains.
Key Design Decisions
- Hierarchical resolution with caching at each level — O(1) average with warm cache
- Short TTL before planned changes — enables fast failover
- GeoDNS for latency-based routing — users reach their nearest datacenter
- DNS is not for session affinity — use a load balancer for sticky sessions
- Negative caching — prevents repeated NXDOMAIN queries from slow applications
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What are the four types of DNS servers and what does each do?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”DNS resolution involves four types of servers: (1) DNS Resolver (Recursive Resolver): provided by ISP or public (8.8.8.8, 1.1.1.1). Receives queries from clients. Does the recursive work of querying root, TLD, and authoritative servers. Caches results. The client only talks to this server. (2) Root Name Server: 13 root server clusters (A through M). Knows where the TLD name servers are (.com, .org, .io). Does not know actual IP addresses of hosts. (3) TLD (Top-Level Domain) Name Server: manages a top-level domain (.com, .net, .io). Knows the authoritative name servers for each registered domain under the TLD. (4) Authoritative Name Server: holds the actual DNS records for a domain (A, CNAME, MX, etc.). Managed by the domain owner or their DNS provider (Cloudflare, Route 53). Returns the final answer. The recursive resolver walks this chain on a cache miss.”}},{“@type”:”Question”,”name”:”What is the difference between an A record, CNAME, and ALIAS record?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A record: maps a hostname directly to an IPv4 address. Example: api.example.com → 1.2.3.4. Simple, direct. Can have multiple A records for the same name (round-robin load balancing). CNAME (Canonical Name): maps a hostname to another hostname. Example: www.example.com → example.com. The resolver follows the chain until it reaches an A/AAAA record. Restriction: a CNAME cannot coexist with other record types for the same name, and you cannot put a CNAME at the zone apex (example.com itself) — only on subdomains (www.example.com). ALIAS (or ANAME): a non-standard record supported by Route 53 and Cloudflare. Behaves like a CNAME but can exist at the zone apex. Route 53 resolves the alias target and returns its IP addresses — transparent to the resolver. Use case: pointing example.com to an ELB (which only has a hostname, not an IP).”}},{“@type”:”Question”,”name”:”How does DNS TTL affect failover and system reliability?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”TTL (Time-to-Live) specifies how many seconds a DNS record can be cached. Impact on failover: if you change a DNS record (old IP → new IP), clients that have the old record cached won't see the change until their TTL expires. With TTL=3600 (1 hour), failover to a new IP takes up to 1 hour to propagate globally. Lower TTL = faster propagation, more DNS queries (cost, latency). Best practice: before a planned migration or during incident preparation, lower TTL to 60s at least 24 hours in advance (to let existing 3600s caches expire). After the change propagates, restore to 3600s. Automatic failover with health checks: Route 53 health checks can automatically switch a record from unhealthy to backup IP. With TTL=60s, failover is near-instant (clients re-resolve within 60s). For microservices internal DNS: TTL=30s or lower to enable fast instance rotation. Java applications: set networkaddress.cache.ttl=30 to respect DNS TTL (Java caches DNS indefinitely by default).”}},{“@type”:”Question”,”name”:”How does GeoDNS enable multi-region routing and CDN acceleration?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”GeoDNS returns different DNS answers based on the geographic location of the DNS query. Authoritative name server detects the client's approximate location (via EDNS0 Client Subnet extension or the recursive resolver's IP). Returns the IP of the nearest datacenter or CDN PoP. Example: US client → 1.2.3.4 (US-East), EU client → 5.6.7.8 (EU-West). Benefits: (1) Latency reduction — user connects to the nearest server (50ms instead of 200ms). (2) Traffic distribution — organic geo-based load balancing. (3) Compliance — EU traffic can be routed to EU datacenters to satisfy GDPR data residency. (4) CDN acceleration — CDNs (Cloudflare, Akamai) use GeoDNS to route users to the nearest edge PoP. Implementation: major DNS providers (Route 53, Cloudflare, NS1) support GeoDNS natively. Pair with health checks: if the nearest region is down, route to the next nearest.”}},{“@type”:”Question”,”name”:”How do you design a system to handle DNS failures and cache misses gracefully?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”DNS failures can cascade: if your application makes DNS lookups on every connection (e.g., re-resolving a microservice hostname on each request), a DNS outage kills the service. Resilience patterns: (1) Application-level DNS caching: cache resolved IPs in memory with a TTL (respect DNS TTL, default to 30-60s). Libraries: Go's net package caches by default; Java requires explicit configuration (networkaddress.cache.ttl). (2) Connection pooling: maintain persistent TCP connections (HTTP/2, gRPC) — no DNS lookup per request. (3) Retry with backoff: on DNS resolution failure, retry 3 times with exponential backoff before failing. (4) Fallback to stale cache: if DNS is unreachable, serve the last cached IP. Stale is better than no IP. (5) Avoid DNS in hot paths: resolve service IPs at startup and on periodic refresh, not on each request. (6) Internal service mesh (Envoy, Istio): uses service registry instead of DNS for inter-service discovery — eliminates DNS from the microservice call path.”}}]}
Google system design covers DNS and distributed naming. See common questions for Google interview: DNS and distributed naming system design.
Amazon system design covers DNS and Route 53. Review patterns for Amazon interview: Route 53 and DNS system design.
Atlassian system design covers DNS and service discovery. See design patterns for Atlassian interview: DNS and service discovery design.