System Design: Load Balancer — How to Design Traffic Distribution at Scale
Load balancers are one of the most fundamental infrastructure components you’ll design in system design interviews. They distribute incoming traffic across multiple servers to maximize throughput, minimize latency, and eliminate single points of failure.
What a Load Balancer Does
- Traffic distribution — spreads requests across healthy backend servers
- Health checking — detects and routes around failed servers
- SSL termination — decrypts HTTPS at the LB, passes HTTP to backends
- Session persistence — sticky sessions route same client to same server
- DDoS protection — absorbs traffic spikes, rate-limits abusive IPs
Layer 4 vs Layer 7 Load Balancing
| Dimension | Layer 4 (Transport) | Layer 7 (Application) |
|---|---|---|
| Operates on | TCP/UDP packets | HTTP/HTTPS requests |
| Content awareness | No — sees IP/port only | Yes — sees URL, headers, cookies |
| Performance | Faster (no parsing) | Slower (full HTTP parsing) |
| Routing flexibility | Low | High (path-based, header-based) |
| Example tools | AWS NLB, HAProxy TCP | AWS ALB, NGINX, Envoy |
Load Balancing Algorithms
Round Robin
Requests rotate sequentially across servers. Simple, works well when servers are homogeneous. Weighted round robin assigns more requests to higher-capacity servers.
Least Connections
Routes to the server with fewest active connections. Better than round robin for long-lived connections (WebSockets, file uploads). Least Response Time adds latency measurement.
Consistent Hashing
Maps requests to servers via hash ring. Same client (IP or session ID) always routes to same server unless that server fails. Minimizes cache invalidation when servers are added/removed. Used by distributed caches (Memcached clusters, DynamoDB partitioning).
import hashlib
class ConsistentHashRing:
def __init__(self, servers, virtual_nodes=150):
self.ring = {}
self.sorted_keys = []
for server in servers:
for i in range(virtual_nodes):
key = self._hash(f"{server}-{i}")
self.ring[key] = server
self.sorted_keys.append(key)
self.sorted_keys.sort()
def _hash(self, s):
return int(hashlib.md5(s.encode()).hexdigest(), 16)
def get_server(self, request_key):
h = self._hash(request_key)
for k in self.sorted_keys:
if h <= k:
return self.ring[k]
return self.ring[self.sorted_keys[0]] # wrap around
Health Checks
Load balancers detect unhealthy backends via:
- TCP health check — can the server accept a connection?
- HTTP health check — does GET /health return 200?
- Application health check — does /health verify DB connectivity, cache, dependencies?
Typical config: check every 5s, mark unhealthy after 2 failures, restore after 3 successes. Circuit breaker pattern adds retry budgets and exponential backoff.
High Availability Architecture
Internet
│
┌───▼───────────────────────────────────────┐
│ DNS (Route 53 / Cloudflare) │
│ GeoDNS → nearest PoP │
└───┬───────────────────────────────────────┘
│
┌───▼───────────────────────────────────────┐
│ Edge / CDN layer (static, caching) │
└───┬───────────────────────────────────────┘
│
┌───▼──────────────┐ ┌────────────────────┐
│ LB Primary │──▶│ LB Standby │
│ (active) │ │ (heartbeat/VRRP) │
└───┬──────────────┘ └────────────────────┘
│
├──▶ App Server 1
├──▶ App Server 2
└──▶ App Server N
Active-passive: standby takes over if primary fails (30-60s failover). Active-active: both LBs handle traffic simultaneously (no failover gap, requires session sync).
Global Load Balancing
- GeoDNS — routes clients to nearest data center based on IP geolocation
- Anycast — same IP advertised from multiple PoPs; BGP routes to nearest (used by Cloudflare, AWS Global Accelerator)
- Latency-based routing — measure actual latency to each region, route to lowest
Interview Design Questions
- “Design a load balancer that handles 1M RPS” — focus on horizontal scaling, consistent hashing, health checks
- “How do you handle sticky sessions without a centralized session store?” — consistent hashing by session ID
- “How does AWS ALB differ from NLB?” — Layer 7 vs Layer 4, use cases
- “What happens when a backend goes down mid-request?” — connection draining (graceful shutdown), circuit breaker
Key Metrics to Monitor
- Requests per second (RPS) per backend
- Active connections per backend
- P50/P95/P99 latency
- Error rate (5xx responses)
- Health check failure rate
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between Layer 4 and Layer 7 load balancing?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Layer 4 load balancers operate at the TCP/UDP level—they see IP addresses and ports but not HTTP content. They are faster because they do no content parsing. Layer 7 load balancers operate at the HTTP/HTTPS level and can route based on URL path, headers, cookies, and body content, enabling advanced patterns like path-based routing, A/B testing, and SSL termination. Use L4 (AWS NLB) for raw throughput; use L7 (AWS ALB, NGINX) when content-aware routing is needed.” }
},
{
“@type”: “Question”,
“name”: “How does consistent hashing work in load balancing?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Consistent hashing maps both servers and requests to positions on a virtual ring using a hash function. Each request routes to the nearest server clockwise on the ring. When a server is added or removed, only ~1/n of requests are remapped (vs all requests with modular hashing). Virtual nodes (each server mapped to 150+ ring positions) ensure uniform distribution even with few servers.” }
},
{
“@type”: “Question”,
“name”: “How do you achieve high availability for a load balancer itself?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Run load balancers in active-passive or active-active pairs with a floating virtual IP (VIP). In active-passive mode, the standby monitors the primary via heartbeat (VRRP protocol) and takes over the VIP within seconds if the primary fails. In active-active mode, both handle traffic simultaneously with shared state sync, eliminating failover delay. At the DNS level, use health-check-based GeoDNS to route around entire datacenter failures.” }
}
]
}