A load balancer sits in front of a pool of servers and distributes incoming requests across them. It is one of the first components you add when a single server can no longer handle your traffic, and it appears in virtually every system design interview. The question is rarely “do you need a load balancer?” — it’s “which algorithm, which layer, and how do you handle failures?”
Strategy
Before describing a load balancer, make sure you’re solving the right problem:
- Throughput: One server is at 100% CPU — distribute requests across more servers.
- Availability: If one server crashes, traffic must automatically route to healthy ones.
- Geographic distribution: Route users to the nearest data center (DNS-based load balancing, Anycast).
L4 vs. L7 Load Balancers
This is the most important distinction to get right.
L4 (Transport Layer)
Operates at the TCP/UDP level. Routes traffic based on IP address and port number. It doesn’t inspect the content of packets — it just forwards TCP connections.
- Fast and low-overhead — minimal processing per packet.
- Can’t make routing decisions based on content (URL path, headers, cookies).
- One TCP connection from client goes to one backend for its lifetime (connection-level routing).
Use when: Raw TCP throughput, non-HTTP protocols (databases, game servers, streaming), or when you need minimal latency overhead.
L7 (Application Layer)
Operates at the HTTP/HTTPS level. Inspects request content — URL, headers, cookies, body. Can make intelligent routing decisions.
- Route
/api/*to API servers and/static/*to CDN or file servers. - Route based on the
Hostheader (virtual hosting). - Terminate TLS — decrypt HTTPS once at the load balancer, forward plain HTTP to backends.
- Set sticky sessions via cookies.
- Higher overhead than L4 (must parse HTTP).
Use when: Web applications, APIs, microservices, anything where you need to route by URL or header. nginx, HAProxy (in HTTP mode), AWS ALB, and Cloudflare are L7 load balancers.
# nginx L7 routing example
upstream api_servers {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}
upstream static_servers {
server 10.0.0.3:80;
}
server {
location /api/ {
proxy_pass http://api_servers;
}
location /static/ {
proxy_pass http://static_servers;
}
}
Load Balancing Algorithms
Round Robin
Requests are distributed sequentially across servers. Request 1 → Server A, Request 2 → Server B, Request 3 → Server C, Request 4 → Server A, and so on.
Pros: Simple. Works well when all servers have equal capacity and requests have similar cost.
Cons: Doesn’t account for server load. If one server is processing a slow request, it still receives the next one in rotation. Weighted round-robin fixes this: assign more capacity to more powerful servers.
Least Connections
Route each new request to the server with the fewest active connections.
Pros: Adapts to varying request durations. If one server is processing many long-running requests, it gets fewer new ones.
Cons: Requires the load balancer to track active connections — more state, slightly more overhead.
When to use: Workloads with variable request duration (long-polling, streaming, WebSockets).
IP Hash (Sticky by IP)
Hash the client’s IP address to consistently route them to the same server.
Pros: Natural sticky sessions — the same user always hits the same server (useful if in-memory session state lives on the server).
Cons: Uneven distribution if many users share an IP (corporate NAT, CDN). Doesn’t adapt when a server is overloaded. Poor choice for applications behind a proxy.
Least Response Time
Route to the server with the lowest average response time and fewest active connections combined. Most sophisticated; used by premium load balancers (HAProxy, F5).
Random
Pick a random server. Surprisingly effective at scale — the law of large numbers produces even distribution. Used by Netflix’s Ribbon client-side load balancer for inter-service calls.
Health Checks
A load balancer must know when a backend is unhealthy and stop sending traffic to it. Two types:
Passive health checks: The load balancer watches real traffic. If a backend returns 5xx errors or times out repeatedly, it’s marked unhealthy. Low overhead but slow to detect failures.
Active health checks: The load balancer periodically sends synthetic requests to a health endpoint.
# nginx active health check
upstream api_servers {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
# nginx Plus (commercial) or use upstream_check module
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0
";
check_http_expect_alive http_2xx;
}
Health endpoint design: GET /health should check that the service can actually handle requests — DB connection alive, dependencies reachable — not just that the HTTP server is up.
# Good health endpoint
@app.route("/health")
def health():
try:
db.execute("SELECT 1") # verify DB connection
cache.ping() # verify cache connection
return {"status": "ok"}, 200
except Exception as e:
return {"status": "error", "detail": str(e)}, 503
Sticky Sessions
Some applications store session state on the server (in-memory). The load balancer must route a user’s requests to the same server every time.
Cookie-based stickiness: The load balancer sets a cookie identifying the backend server. On subsequent requests, it reads the cookie and routes accordingly. This is the L7 approach — AWS ALB calls this “sticky sessions.”
Problems with sticky sessions:
- Uneven load — one server may accumulate many heavy users while others are idle.
- Server failure breaks all sticky sessions on that server.
Better approach: Don’t store session state on the server. Use a shared session store (Redis) so any backend can handle any request. Then sticky sessions are unnecessary.
The Load Balancer as a Single Point of Failure
Interviewers will ask: “What if the load balancer itself goes down?” The answer: run two load balancers in active-passive or active-active mode.
- Active-passive: One load balancer handles traffic; the other is on standby. A keepalived/VRRP protocol detects failure and promotes the passive one. The virtual IP floats to the active node.
- Active-active: Both load balancers handle traffic. DNS round-robin points to both IPs. More complex but no idle capacity.
- Cloud managed: AWS ELB, GCP Cloud Load Balancing, and Cloudflare are managed services with redundancy built in — no SPOF to worry about.
DNS Load Balancing
Return multiple A records for a domain. The client picks one (usually the first). Simple but crude — DNS TTLs are long, so failover is slow and you can’t control which record clients use.
Used for geographic routing (Route 53 latency-based routing, Cloudflare Anycast) and as an outer load balancing layer that routes to regional clusters, each of which has its own L7 load balancer.
Summary
Load balancers distribute traffic for throughput and availability. L4 balancers route by IP/port with minimal overhead; L7 balancers route by HTTP content and terminate TLS. Round-robin works for homogeneous workloads; least-connections adapts to variable request duration. Health checks detect failed backends quickly — design a real health endpoint, not just a ping. Eliminate sticky sessions by moving session state to Redis. Avoid making the load balancer itself a single point of failure. In a cloud environment, use managed load balancers (ELB, ALB, GCP LB) and get redundancy for free.
Related System Design Topics
Load balancing sits at the front of most distributed architectures:
- Consistent Hashing — some load balancers use consistent hashing on a key (user ID, session ID) to route requests to the same backend, achieving sticky sessions without cookies.
- Caching Strategies — reverse proxy caches (nginx, Varnish) often sit at the same layer as load balancers and serve cached responses before requests ever reach application servers.
- Message Queues — when load spikes overwhelm application servers, queuing requests is an alternative to throwing more servers behind a load balancer.
- Database Sharding — a shard router performs the same conceptual role as a load balancer, but at the database tier instead of the application tier.
Also see: API Design (REST vs GraphQL vs gRPC) and SQL vs NoSQL — the remaining two system design foundations.
See also: Design a Ride-sharing App — sticky WebSocket routing for real-time driver tracking, and Design a Notification System — load-balancing push/email/SMS workers.
See also: Design an LLM Inference API — prefix-aware load balancing routes requests with shared system prompts to the same GPU pod for KV cache reuse.