System Design Interview: Design a Load Balancer

System Design: Load Balancer — How to Design Traffic Distribution at Scale

Load balancers are one of the most fundamental infrastructure components you’ll design in system design interviews. They distribute incoming traffic across multiple servers to maximize throughput, minimize latency, and eliminate single points of failure.

What a Load Balancer Does

  • Traffic distribution — spreads requests across healthy backend servers
  • Health checking — detects and routes around failed servers
  • SSL termination — decrypts HTTPS at the LB, passes HTTP to backends
  • Session persistence — sticky sessions route same client to same server
  • DDoS protection — absorbs traffic spikes, rate-limits abusive IPs

Layer 4 vs Layer 7 Load Balancing

Dimension Layer 4 (Transport) Layer 7 (Application)
Operates on TCP/UDP packets HTTP/HTTPS requests
Content awareness No — sees IP/port only Yes — sees URL, headers, cookies
Performance Faster (no parsing) Slower (full HTTP parsing)
Routing flexibility Low High (path-based, header-based)
Example tools AWS NLB, HAProxy TCP AWS ALB, NGINX, Envoy

Load Balancing Algorithms

Round Robin

Requests rotate sequentially across servers. Simple, works well when servers are homogeneous. Weighted round robin assigns more requests to higher-capacity servers.

Least Connections

Routes to the server with fewest active connections. Better than round robin for long-lived connections (WebSockets, file uploads). Least Response Time adds latency measurement.

Consistent Hashing

Maps requests to servers via hash ring. Same client (IP or session ID) always routes to same server unless that server fails. Minimizes cache invalidation when servers are added/removed. Used by distributed caches (Memcached clusters, DynamoDB partitioning).

import hashlib

class ConsistentHashRing:
    def __init__(self, servers, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        for server in servers:
            for i in range(virtual_nodes):
                key = self._hash(f"{server}-{i}")
                self.ring[key] = server
                self.sorted_keys.append(key)
        self.sorted_keys.sort()

    def _hash(self, s):
        return int(hashlib.md5(s.encode()).hexdigest(), 16)

    def get_server(self, request_key):
        h = self._hash(request_key)
        for k in self.sorted_keys:
            if h <= k:
                return self.ring[k]
        return self.ring[self.sorted_keys[0]]  # wrap around

Health Checks

Load balancers detect unhealthy backends via:

  • TCP health check — can the server accept a connection?
  • HTTP health check — does GET /health return 200?
  • Application health check — does /health verify DB connectivity, cache, dependencies?

Typical config: check every 5s, mark unhealthy after 2 failures, restore after 3 successes. Circuit breaker pattern adds retry budgets and exponential backoff.

High Availability Architecture

Internet
    │
┌───▼───────────────────────────────────────┐
│  DNS (Route 53 / Cloudflare)              │
│  GeoDNS → nearest PoP                    │
└───┬───────────────────────────────────────┘
    │
┌───▼───────────────────────────────────────┐
│  Edge / CDN layer (static, caching)       │
└───┬───────────────────────────────────────┘
    │
┌───▼──────────────┐   ┌────────────────────┐
│  LB Primary      │──▶│  LB Standby        │
│  (active)        │   │  (heartbeat/VRRP)  │
└───┬──────────────┘   └────────────────────┘
    │
    ├──▶ App Server 1
    ├──▶ App Server 2
    └──▶ App Server N

Active-passive: standby takes over if primary fails (30-60s failover). Active-active: both LBs handle traffic simultaneously (no failover gap, requires session sync).

Global Load Balancing

  • GeoDNS — routes clients to nearest data center based on IP geolocation
  • Anycast — same IP advertised from multiple PoPs; BGP routes to nearest (used by Cloudflare, AWS Global Accelerator)
  • Latency-based routing — measure actual latency to each region, route to lowest

Interview Design Questions

  • “Design a load balancer that handles 1M RPS” — focus on horizontal scaling, consistent hashing, health checks
  • “How do you handle sticky sessions without a centralized session store?” — consistent hashing by session ID
  • “How does AWS ALB differ from NLB?” — Layer 7 vs Layer 4, use cases
  • “What happens when a backend goes down mid-request?” — connection draining (graceful shutdown), circuit breaker

Key Metrics to Monitor

  • Requests per second (RPS) per backend
  • Active connections per backend
  • P50/P95/P99 latency
  • Error rate (5xx responses)
  • Health check failure rate

  • Shopify Interview Guide
  • Airbnb Interview Guide
  • Twitter Interview Guide
  • Uber Interview Guide
  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is the difference between Layer 4 and Layer 7 load balancing?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Layer 4 load balancers operate at the TCP/UDP level—they see IP addresses and ports but not HTTP content. They are faster because they do no content parsing. Layer 7 load balancers operate at the HTTP/HTTPS level and can route based on URL path, headers, cookies, and body content, enabling advanced patterns like path-based routing, A/B testing, and SSL termination. Use L4 (AWS NLB) for raw throughput; use L7 (AWS ALB, NGINX) when content-aware routing is needed.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does consistent hashing work in load balancing?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Consistent hashing maps both servers and requests to positions on a virtual ring using a hash function. Each request routes to the nearest server clockwise on the ring. When a server is added or removed, only ~1/n of requests are remapped (vs all requests with modular hashing). Virtual nodes (each server mapped to 150+ ring positions) ensure uniform distribution even with few servers.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do you achieve high availability for a load balancer itself?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Run load balancers in active-passive or active-active pairs with a floating virtual IP (VIP). In active-passive mode, the standby monitors the primary via heartbeat (VRRP protocol) and takes over the VIP within seconds if the primary fails. In active-active mode, both handle traffic simultaneously with shared state sync, eliminating failover delay. At the DNS level, use health-check-based GeoDNS to route around entire datacenter failures.” }
    }
    ]
    }

    Scroll to Top