System Design Interview: How DNS Works and How to Design It
Understanding DNS (Domain Name System) is fundamental to backend engineering and system design. DNS translates human-readable domain names into IP addresses. Asked at Cloudflare, Google, Amazon, and any role involving infrastructure or distributed systems.
What DNS Does
DNS is a hierarchical, distributed database that maps domain names to IP addresses (and other record types). When you type techinterview.org, your computer uses DNS to find the IP address of the server hosting that site.
DNS Hierarchy
Root DNS Servers (13 logical servers, operated by ICANN)
|
TLD DNS Servers (Top-Level Domain)
- .com, .org, .net, .io, .gov, etc.
- Operated by registries (Verisign for .com)
|
Authoritative DNS Servers
- Holds the actual DNS records for a domain
- Operated by domain owner or DNS provider (Cloudflare, Route53)
|
Recursive Resolver (DNS Resolver)
- Operated by ISPs, Google (8.8.8.8), Cloudflare (1.1.1.1)
- Caches results, queries hierarchy on cache miss
DNS Resolution: Full Lookup
User types techinterview.org in browser:
1. Browser checks its own DNS cache
2. OS checks system cache (/etc/hosts, OS resolver cache)
3. Query sent to Recursive Resolver (e.g., 8.8.8.8)
4. Resolver checks its cache - if miss:
5. Resolver queries a Root Server: "who handles .org?"
Root responds: "TLD server at 199.19.56.1"
6. Resolver queries .org TLD server: "who handles techinterview.org?"
TLD responds: "Authoritative server at ns1.cloudflare.com"
7. Resolver queries Authoritative Server: "What is the IP for techinterview.org?"
Authoritative responds: "67.207.83.166, TTL=300"
8. Resolver caches the answer (TTL=300 seconds)
9. Browser connects to 67.207.83.166
DNS Record Types
A - IPv4 address mapping techinterview.org -> 67.207.83.166
AAAA - IPv6 address mapping
CNAME - Canonical name (alias) www -> techinterview.org
MX - Mail server techinterview.org -> mail.google.com (priority 10)
TXT - Text data (SPF, DKIM, verification)
NS - Name servers for the domain techinterview.org NS ns1.cloudflare.com
SOA - Start of Authority (zone metadata)
PTR - Reverse DNS (IP -> domain) 67.207.83.166 -> techinterview.org
SRV - Service location _http._tcp.example.com -> server:port
TTL and Caching
Every DNS record has a TTL (Time to Live) in seconds. Resolvers cache records for TTL duration. After TTL expires, resolver re-queries the authoritative server.
- Low TTL (60s): faster propagation for frequent changes, more resolver load
- High TTL (3600s, 86400s): fewer resolver queries, slower propagation
- Before a migration: lower TTL to 60s in advance, change IP, wait for old TTL to expire
- DNS propagation: time for TTL-cached old records to expire globally (up to 48h for high-TTL records)
Designing a DNS Server
Authoritative DNS
DNS Query (UDP port 53):
Query: {domain: "techinterview.org", type: "A"}
DNS Server:
1. Parse UDP packet (RFC 1035 format)
2. Look up zone file in memory:
zones = {
"techinterview.org": {
"A": ["67.207.83.166"],
"MX": [{"priority": 10, "host": "mail.google.com"}],
"TTL": 300
}
}
3. Return DNS response packet with answer
Zone storage: BIND zone files or database (PostgreSQL, Cassandra)
Cache zone data in memory for fast lookup (microsecond response)
Recursive Resolver
On cache miss:
1. Query root server (hardcoded 13 root server IPs, anycast)
2. Query TLD server (from root response)
3. Query authoritative server (from TLD response)
4. Cache result with TTL
5. Return to client
Cache: LRU with TTL expiry
Scale: horizontal (stateless resolvers + shared Redis cache)
DNSSEC: validate cryptographic signatures at each level
DNS Load Balancing
- Round-robin DNS: return multiple A records, clients pick one. Simple but no health checking.
- Geo DNS: return different IPs based on client location (Route53 Geolocation, Cloudflare). Route US users to US servers.
- Latency-based routing: return IP of lowest-latency region for that client (Route53 Latency). Requires latency measurements per region.
- Weighted routing: split traffic by weight (90% prod, 10% canary). Useful for gradual deployments.
- Failover: primary + health check; automatically switch to secondary on primary failure.
DNS at Scale (Cloudflare/Google Scale)
- Anycast: same IP announced from hundreds of PoPs globally. Client connects to nearest PoP by BGP routing. Provides both geo-distribution and DDoS resilience.
- 100B+ DNS queries/day at Cloudflare: in-memory cache with LRU eviction, microsecond latency
- DDoS protection: DNS amplification attacks use spoofed source IPs. Rate limiting + source validation (Response Rate Limiting).
Interview Tips
- Walk through the full resolution chain: browser cache → OS cache → recursive resolver → root → TLD → authoritative
- Explain TTL and why you lower it before a migration
- Know the common record types (A, CNAME, MX, TXT, NS)
- Discuss DNS load balancing strategies (geo, latency, weighted, failover)
- Mention anycast for global distribution of DNS infrastructure