Low Level Design: Design Patterns for Distributed Systems

⏱ 7 min read

Why Distributed Systems Need Explicit Patterns

In a monolith, a function call either succeeds or throws an exception — the contract is clear. In a distributed system, calls cross network boundaries where partial failure is the norm: the network can drop packets, the remote service can be slow, and you can’t tell the difference between "the request never arrived" and "the response got lost." These patterns are the vocabulary for handling that uncertainty reliably and safely at scale.

Retry with Exponential Backoff and Jitter

Transient failures (brief network blip, temporary overload) are expected and retriable. Naive immediate retry floods an already-struggling service. Exponential backoff doubles the delay between each attempt: 100ms → 200ms → 400ms → 800ms → cap at 30s. This gives the downstream service time to recover.

Jitter is critical: without it, many clients that all started retrying at the same time will all retry again simultaneously at each backoff interval — the thundering herd problem. Add a random component: delay = random(0, min(cap, base * 2^attempt)). This spreads retry load over time, preventing the synchronized wave from re-overwhelming the recovering service.

// Full jitter (AWS recommended)
sleep = random_between(0, min(cap, base * 2 ** attempt))

Only retry on transient errors (5xx, timeouts, connection refused). Never retry on client errors (4xx) — those indicate a problem with the request itself that retrying won’t fix.

Idempotency Pattern

If a caller retries after a timeout, the server may have already processed the first request. Without idempotency, retrying a payment or an email send causes double charges or duplicate messages.

The pattern: the caller generates a unique idempotency key (UUID) per logical operation and sends it in the request header. The server stores a mapping of idempotency_key → result in a durable store (database or Redis with long TTL). On receiving a request, it checks for an existing result:

If found: return the cached result immediately, no processing.
If not found: process the request, store the result keyed by idempotency key, return the result.

The key must be stored atomically with the result (same DB transaction) to prevent the race condition where two concurrent requests with the same key both see "not found" and both process. Stripe, Braintree, and most payment APIs require and document this pattern explicitly.

Bulkhead Pattern

Named after ship compartments that prevent flooding from spreading: isolate failure domains so a problem in one area doesn’t take down the whole system. The practical application is separate thread pools or connection pools per downstream service.

Without bulkheads: if Service A calls Services B, C, and D using a shared thread pool of 100 threads, and Service D becomes slow (threads pile up waiting), all 100 threads are eventually tied up waiting for D — calls to B and C also start failing even though B and C are healthy.

With bulkheads: allocate 30 threads for calls to B, 30 for C, 40 for D. Service D’s slowness can only exhaust its own 40 threads. B and C remain unaffected. Netflix Hystrix popularized this pattern. In Kubernetes, resource limits per pod serve a similar bulkhead function at the infrastructure level.

Timeout Pattern

Every remote call must have an explicit timeout. The failure mode without timeouts: threads accumulate waiting for a response that will never come, eventually exhausting the thread pool (the bulkhead failure mode described above, even with bulkheads).

The design principle: a short timeout plus retry is almost always better than a long timeout. A 500ms timeout with 3 retries (total worst case ~1.5s plus backoff) gives better P99 latency than a single 10s timeout, and fails fast enough that the caller can try an alternative or degrade gracefully. Set timeouts at the connection level (TCP connect), the read level (waiting for response bytes), and the overall request level (end-to-end deadline).

Deadline propagation: in a call chain (A → B → C → D), each hop should respect the remaining time budget from the upstream deadline. gRPC deadline propagation does this automatically. Without it, B may retry C and D for 30 seconds even though A will time out in 2 seconds — wasted work that still results in failure.

Circuit Breaker Pattern

Retrying against a completely down service wastes resources and delays the caller’s failure response. The circuit breaker pattern wraps calls with a state machine:

CLOSED   → OPEN      (failure rate exceeds threshold over a window)
OPEN     → HALF-OPEN (after a cooldown period, try one probe request)
HALF-OPEN → CLOSED   (probe succeeded — resume normal traffic)
HALF-OPEN → OPEN     (probe failed — stay open, reset cooldown)

While OPEN, calls fail immediately without hitting the downstream service — fast failure for the caller, zero load on the struggling service. This is complementary to retry: retry handles transient blips; circuit breaker handles sustained outages. Resilience4j (Java) and Polly (.NET) are popular implementations.

Outbox Pattern

A common distributed systems problem: you need to update a database record AND publish an event to a message queue atomically. You cannot use a 2-phase commit across DB and message broker in practice (slow, fragile). If you write to DB first and then publish, a crash between the two steps loses the event. If you publish first and then write, a crash leaves a ghost event with no matching DB state.

The outbox pattern solves this: write the event to an outbox table in the same database transaction as the business logic change. A separate relay process reads from the outbox table and publishes to the message broker, then marks the row as published.

Transaction:
  UPDATE orders SET status = 'confirmed' WHERE order_id = ?;
  INSERT INTO outbox (event_type, payload, created_at)
    VALUES ('order.confirmed', '{"order_id": 123}', NOW());

Relay process (separate):
  SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100;
  -- publish each to Kafka/RabbitMQ
  UPDATE outbox SET published = true WHERE id IN (...)

This guarantees exactly-once publication (the outbox row and DB change are atomic). The relay may publish duplicates if it crashes after publishing but before marking as published — consumers must be idempotent. Debezium (change data capture) can serve as the relay by reading the DB write-ahead log directly.

Saga Pattern

Distributed transactions across microservices can’t use ACID transactions. The Saga pattern replaces a distributed transaction with a sequence of local transactions, each publishing an event that triggers the next step. If a step fails, previously completed steps are undone by executing compensating transactions.

Two coordination styles: Choreography (each service reacts to events and emits its own — decentralized, harder to track) and Orchestration (a central saga orchestrator sends commands to each service and handles failures — more explicit, easier to reason about). For complex flows with many failure modes, orchestration is preferred. The orchestrator is typically implemented as a state machine persisted in a database.

CQRS: Command Query Responsibility Segregation

CQRS separates the write model (commands that change state) from the read model (queries that return data). This allows each to be optimized independently:

Write side: normalized relational schema, optimized for write throughput and consistency.
Read side: denormalized, pre-joined views (read replicas, ElasticSearch indexes, materialized views) optimized for specific query patterns.

Updates propagate from write to read model asynchronously (via events), introducing eventual consistency. CQRS is powerful but adds complexity — justify it only when read and write workloads have fundamentally different scaling or structure requirements. Often combined with event sourcing, where the write model stores a log of events rather than current state.

Sidecar and Ambassador Patterns

Both patterns attach auxiliary functionality to an application via a separate co-located process (typically a container in a Kubernetes pod).

Sidecar: the auxiliary container provides supplementary capabilities — log shipping, metrics collection, service mesh proxy (Envoy/Linkerd), certificate rotation. The app remains unaware of these concerns. The sidecar shares the network namespace with the app container, so it can intercept traffic transparently.

Ambassador: a specific sidecar that acts as a local proxy for outbound calls. The app calls localhost:8080/service-b/endpoint and the ambassador handles retry logic, circuit breaking, authentication, TLS, and service discovery. The app code stays simple — it only knows "call localhost." This pattern is used by service mesh data planes (Envoy as ambassador) and simplifies polyglot environments where implementing retry/circuit breaker in every language is impractical.

Strangler Fig Pattern

Migrating a large legacy system to a new architecture via a big-bang rewrite is extremely risky — the new system may not be ready, behavior may diverge, and rollback is difficult. The Strangler Fig pattern (named after a vine that gradually envelops a tree) provides an incremental migration path:

Put a routing proxy in front of the legacy system.
Implement a new feature or migrated module in the new system.
Route traffic for that specific feature/module to the new system; all other traffic still goes to legacy.
Repeat until all traffic is routed to the new system, then decommission legacy.

The legacy system is "strangled" gradually. At any point during migration, the system is fully functional — if the new module has issues, traffic for it can be re-routed back to legacy. This approach is used by Martin Fowler’s team at ThoughtWorks and was key in many large-scale legacy modernization projects (e.g., migrating monoliths to microservices at companies like Amazon and Netflix).

Key Design Decisions Summary

Always retry with jitter — prevents thundering herd on recovery.
Idempotency keys on all mutating operations — safe to retry without side effects.
Bulkheads per downstream dependency — contain blast radius of failures.
Short timeouts + retry over long timeouts — faster failure detection.
Circuit breaker complements retry — don’t hammer a down service.
Outbox pattern for DB + event atomicity — no 2PC, no lost events.
Saga for distributed workflows — explicit compensating transactions over implicit locking.
Strangler fig for legacy migration — incremental, low-risk modernization.