System Design: Circuit Breaker, Retry, and Bulkhead — Resilience Patterns for Microservices

Why Resilience Patterns?

In a microservices architecture, service A calls service B which calls service C. If C is slow or failing: B’s threads block waiting for C. B’s thread pool fills up. A’s calls to B start failing too. The cascade continues up the chain: one failing downstream service takes down the entire system. Resilience patterns prevent this cascade. The three key patterns: Circuit Breaker (stop calling a failing service), Retry with backoff (handle transient failures), and Bulkhead (isolate failures to a pool).

Circuit Breaker

A circuit breaker wraps calls to an external service. Three states: CLOSED (normal operation — calls pass through), OPEN (failing — calls are rejected immediately without attempting the downstream call), HALF-OPEN (testing recovery — a limited number of probe requests are allowed through). State transitions: CLOSED → OPEN: when the failure rate exceeds a threshold (e.g., 50% failures in the last 60 seconds, with a minimum of 20 requests). OPEN → HALF-OPEN: after a timeout (e.g., 30 seconds). HALF-OPEN → CLOSED: if probe requests succeed. HALF-OPEN → OPEN: if probe requests fail again.

class CircuitBreaker:
    def __init__(self, failure_threshold=0.5, timeout=30, min_requests=20):
        self.state = "CLOSED"
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.min_requests = min_requests

    def call(self, fn, *args):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit is open")
        try:
            result = fn(*args)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == "HALF_OPEN":
            self.state = "CLOSED"
            self.failure_count = 0
        self.success_count += 1

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        total = self.failure_count + self.success_count
        if total >= self.min_requests and self.failure_count / total >= self.failure_threshold:
            self.state = "OPEN"

Retry with Exponential Backoff and Jitter

Retry handles transient failures (momentary network blip, brief service restart). Naive retry: immediately retry on failure. Problem: if the service is overloaded, immediate retries add more load — stampeding herd effect. Exponential backoff: wait 1s, then 2s, then 4s, then 8s (base * 2^attempt). Reduces load on the recovering service. Jitter: add random variance to the backoff — actual_wait = backoff * (0.5 + random() * 0.5). Without jitter, all clients back off by the same amount and retry simultaneously (synchronized thundering herd). Jitter spreads retries out. Retry only on retriable errors (5xx, timeouts). Never retry non-idempotent operations (POST payment) without idempotency keys. Max retries: 3-5. After max retries: fail fast and return an error.

Bulkhead Pattern

A bulkhead isolates failures to a limited resource pool, preventing a failing call from consuming all shared resources. Thread pool bulkhead: assign a dedicated thread pool for each downstream service. Service A’s calls to Service B use Pool B (10 threads). Calls to Service C use Pool C (5 threads). If Service B hangs and all 10 threads are blocked, Service C calls still work (using Pool C). Without bulkheads, a single hung downstream service consumes all threads in the shared pool, blocking all other calls. Semaphore bulkhead: limit concurrent calls to a downstream service to N. If N are already in flight, reject new calls with a fallback response. Hystrix (Netflix) popularized bulkheads; modern alternatives: Resilience4j (Java), Polly (.NET), Envoy proxy (service mesh level).

Timeout

Every external call must have a timeout. Without a timeout: a hung downstream call blocks a thread indefinitely. Timeout values: P99 response time of the downstream service * 1.5. Too short: too many false timeouts. Too long: resources blocked for too long on real failures. Cascading timeouts: if A calls B which calls C, set A→B timeout > B→C timeout. This ensures B can respond to A with an error before A’s timer fires. Async timeouts: for async operations (message queue consumers), use a deadline propagated via message headers. If the message is processed after the deadline, discard the result.

Fallback Strategies

When a circuit is open or a call times out, a fallback provides a degraded but functional response: return cached data (last successful response for this request), return a default value (empty list, zero count), serve a static response (pre-rendered HTML fallback page), or queue the request for later processing. Fallback selection depends on the use case: a product recommendation fallback might return bestsellers instead of personalized recommendations. A payment service fallback should never be a silent no-op — fail loudly. Design for partial functionality: if the review service is down, show the product page without reviews rather than showing an error page for the entire site.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the three states of a circuit breaker and when does it transition between them?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “CLOSED (normal): all requests pass through to the downstream service. Failures are counted. When the failure rate exceeds the threshold (e.g., 50% failures over the last 60 seconds with a minimum of 20 requests), transition to OPEN. OPEN (tripping): all requests are rejected immediately without calling the downstream service. A fallback is returned. After a configured timeout (e.g., 30 seconds), transition to HALF-OPEN. HALF-OPEN (probe): a limited number of test requests are allowed through. If they succeed, transition back to CLOSED (service recovered). If they fail, transition back to OPEN (still failing, reset the timeout). The minimum request threshold prevents the circuit from opening on the first few failures when traffic just started.”
}
},
{
“@type”: “Question”,
“name”: “Why do you need jitter in exponential backoff?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Without jitter: when a service fails, all clients back off for the same duration (1s, then 2s, then 4s). When the backoff timer expires, all clients retry simultaneously — a synchronized thundering herd that may overwhelm the recovering service just as it comes back up. With jitter (randomized backoff): each client waits for a slightly different duration — actual_wait = base_backoff * random(0.5, 1.5). Retries are spread out over time, reducing the peak load on the recovering service. Full jitter: sleep = random(0, min(cap, base * 2^attempt)). Equal jitter: sleep = min(cap, base * 2^attempt) / 2 + random(0, min(cap, base * 2^attempt) / 2). AWS recommends full jitter for most use cases. The goal is to spread client retries across the recovery window rather than concentrating them at fixed intervals.”
}
},
{
“@type”: “Question”,
“name”: “How does the bulkhead pattern prevent cascading failures?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Without bulkheads: all microservice calls share a single thread pool. If service B becomes slow, threads in the shared pool block waiting for B. The pool fills up. Service C calls also block (no threads available), even though C is healthy. The failure cascades. With thread pool bulkheads: service B calls use a dedicated pool of 10 threads. Service C calls use a separate pool of 5 threads. When B is slow and all 10 B-threads are blocked, C calls still have their 5 threads and continue normally. The failure is contained to the B pool. Each downstream service dependency gets its own isolated resource pool. Size each pool based on the expected concurrency for that dependency (target calls per second * response time). Semaphore bulkheads are lighter-weight — limit concurrent in-flight calls without using separate thread pools (suitable when the calls are non-blocking/async).”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a timeout and a circuit breaker?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Timeouts are defensive: they prevent a single call from blocking indefinitely. A timeout fires after a fixed duration and releases the thread. But timeouts alone don’t prevent the next call from also blocking for that duration. If 100 requests/second are hitting a hung downstream service, 100 threads are blocked for the timeout duration — the pool exhausts. Circuit breakers are offensive: once the failure rate is high, they stop making calls at all. The circuit opens, and subsequent calls fail immediately (sub-millisecond) without consuming a thread or waiting. Timeouts protect individual calls; circuit breakers protect the system from repeatedly hitting a failing dependency. Use both together: timeout config feeds the circuit breaker’s failure metrics (timeout = failure for circuit breaker purposes).”
}
},
{
“@type”: “Question”,
“name”: “How do you implement a fallback strategy when a circuit is open?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The fallback is the response returned when the circuit is open or a call fails after retries. The right fallback depends on the downstream service: cached response (return the last successful response for this request, stored in Redis or in-memory cache with a TTL), default value (empty recommendations list, zero inventory count), degraded response (return a simplified version of the data from an alternative source), or fail fast (re-throw a specific error that the caller handles gracefully — shows a friendly error page rather than a spinner). Design for partial functionality: the product page can render without reviews (show “reviews temporarily unavailable”). The checkout page cannot render without inventory data — fail clearly. Never silently swallow errors in a fallback — log the circuit-open event and track the fallback rate as a metric.”
}
}
]
}

Asked at: Netflix Interview Guide

Asked at: Uber Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Databricks Interview Guide

Scroll to Top