Low Level Design: Circuit Breaker Pattern

State Machine

The circuit breaker is a three-state machine wrapped around each outbound call to a downstream dependency. In the CLOSED state the circuit operates normally: every call passes through, outcomes are recorded, and the failure detector monitors the error rate. When the failure rate crosses the configured threshold, the circuit transitions to OPEN. In the OPEN state every call fails immediately without invoking the downstream service, returning a fallback response; this fail-fast behavior prevents the caller’s threads from blocking on a service that is known to be degraded and stops amplifying load on an already-struggling dependency. After a configurable cooldown period (e.g., 30 seconds), the circuit transitions to HALF-OPEN and allows a limited probe of traffic through. If the probe succeeds, the circuit returns to CLOSED and normal operation resumes. If the probe fails, the circuit returns to OPEN and the cooldown timer resets. This state machine is per-dependency: a service that calls three downstream systems has three independent circuit breakers, so a failure in one dependency does not affect the others.

Failure Detection

Accurate failure detection requires distinguishing genuine downstream degradation from transient noise. Count-based windows trip the circuit when N failures occur within the last M requests (e.g., 5 failures in 20 requests = 25% error rate); this is simple but sensitive to traffic volume — at low traffic, a single failure is a high percentage. Time-based sliding windows accumulate outcomes over the last N seconds (e.g., last 60 seconds) and compute error rate over that window, smoothing over traffic spikes. The failure counter tracks distinct failure types separately: exception count (connection refused, connection reset), timeout count (calls that exceeded the configured timeout), and slow call count (calls that returned a response but exceeded a latency threshold, indicating partial degradation). Each counter has its own configurable threshold; any single threshold breach can trip the circuit, or a combined formula can be used. Minimum call volume requirements prevent tripping during low-traffic periods when a single error would produce a misleading 100% error rate.

Timeout Handling

Every call wrapped by the circuit breaker is subject to a timeout enforced at the circuit breaker layer, independent of any timeout configured in the HTTP client or RPC framework. The circuit breaker timeout should be shorter than the upstream caller’s SLA budget for this dependency, ensuring the circuit detects degradation before it causes cascading timeouts propagating up the call chain. A downstream service that begins responding slowly (e.g., 2-second p99 instead of 50 ms) will accumulate timeout failures against the circuit breaker’s sliding window and trip the circuit before the caller starts dropping its own SLA. Timeout budget tracking across a distributed request chain uses deadline propagation: each downstream call receives a deadline derived from the remaining budget in the incoming request, preventing a slow leaf service from consuming the entire request timeout. Calls that exceed the timeout are counted as failures regardless of whether the downstream eventually responds; the circuit breaker cancels the call and records the timeout.

Fallback Responses

When the circuit is OPEN, the circuit breaker executes a fallback function instead of calling the downstream service. The fallback must be fast, must not itself call the failing service, and must provide a response that allows the system to degrade gracefully rather than return an error to the user. Common fallback strategies: return the last cached successful response for the same request parameters (stale data is often acceptable for read-heavy paths); return a static default value configured at deploy time (e.g., an empty recommendations list, a default price, a cached homepage); return a structured error response that includes a retry-after hint so the client can back off intelligently. The fallback function is registered at circuit breaker initialization and can itself be versioned and hot-reloaded. Complex fallbacks may call a secondary "degraded mode" service (e.g., a simpler recommendation engine that does not rely on the failing personalization service). Fallback invocation count is tracked as a separate metric so engineers can observe how often the system is operating in degraded mode.

Half-Open Probing

The HALF-OPEN state implements a controlled recovery probe to determine whether the downstream service has recovered without allowing a full flood of traffic to hit it simultaneously. After the cooldown duration expires, the circuit allows exactly one request through; all other concurrent requests during this probe period still receive the fallback response. If the probe request succeeds within the timeout, the circuit transitions to CLOSED, resets all failure counters, and resumes normal traffic flow. If the probe fails, the circuit transitions back to OPEN and the cooldown timer resets. This prevents the thundering herd problem: when a downstream service recovers, a circuit that immediately transitions to CLOSED from the backlog of waiting callers would send a burst of requests that could re-overwhelm the just-recovered service. Some implementations allow a configurable probe volume (e.g., 5 probe requests in HALF-OPEN) and require all probes to succeed before closing, providing more confidence in recovery before full traffic resumes.

Bulkhead Pattern

The bulkhead pattern complements the circuit breaker by isolating the resource pools used to call each downstream dependency, preventing a slow or failing dependency from exhausting shared resources and impacting calls to healthy dependencies. Thread pool bulkheads allocate a dedicated thread pool for each downstream service; if calls to service A are all blocking on slow responses, the thread pool for A fills up and new calls are rejected immediately, but the thread pools for services B and C are unaffected. Semaphore bulkheads limit the maximum concurrency of in-flight calls to a dependency without creating a dedicated thread pool, which is more efficient for non-blocking I/O frameworks. Pool size should be tuned to match the downstream service’s concurrency limit: a service that can handle 100 concurrent requests should back a pool of 100 or fewer. When the pool or semaphore is saturated, the rejection triggers the fallback immediately rather than queuing, since queuing would only delay the failure. Bulkhead rejection count is a leading indicator of approaching circuit breaker trips.

Metrics and Observability

Circuit breaker health is exposed through a rich set of metrics that feed dashboards and alerting. Current state (CLOSED=0, HALF-OPEN=1, OPEN=2) is exported as a gauge per dependency; alerts fire when any critical dependency circuit opens. State transition events are emitted as structured log lines and as discrete metric increments, enabling timeline reconstruction during incident review. Failure rate and slow call rate histograms show the distribution of outcomes over time. Latency percentiles (p50, p95, p99) for calls in CLOSED state reveal degradation before the circuit trips. Fallback invocation rate indicates how often callers are receiving degraded responses even when the circuit is CLOSED (due to individual call failures below the trip threshold). A dependency health dashboard aggregates circuit state, error rate, and latency for all downstream calls in a service, providing a single-pane view for on-call engineers. Circuit breaker state transitions generate PagerDuty alerts for high-severity dependencies and Slack notifications for lower-severity ones, with the dependency name and failure reason in the alert payload.

Service Mesh Integration

Envoy proxy implements outlier detection, a passive circuit breaker that operates at the load balancing layer rather than the application layer. Envoy tracks 5xx response rates and request latency per upstream host; when a host exceeds the consecutive 5xx threshold (configurable, default 5), it is ejected from the load balancing set for a base ejection interval (default 30 seconds, doubling on repeated ejections). The ejection percentage cap (default 10%) ensures that at most 10% of hosts in a cluster can be ejected simultaneously, preventing a correlated failure from ejecting all hosts and making the service unreachable. Envoy’s outlier detection is passive (it observes real traffic outcomes) and complements active application-level circuit breakers that can trip on timeout or custom exception types. Using both layers provides defense in depth: Envoy ejection handles bad hosts within a healthy cluster, while the application circuit breaker handles whole-service failures. Envoy reports ejection events to Prometheus via /stats, and these metrics feed the same observability stack as application-level circuit breaker metrics.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “How does the circuit breaker state machine transition between states and what thresholds trigger each transition?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A circuit breaker has three states. Closed (normal operation): requests pass through; failures are counted in a sliding window (count- or time-based). When the failure rate or count exceeds a configured threshold (e.g., 50% failure rate over the last 10 calls, or 5 consecutive failures), the circuit transitions to Open. Open: all requests are immediately rejected or redirected to a fallback without attempting the downstream call, preventing cascade failure and giving the dependency time to recover. After a configurable sleep window (e.g., 30 seconds), the circuit transitions to Half-Open. Half-Open: a limited number of probe requests are allowed through. If they succeed, the circuit closes; if they fail, it returns to Open.” } }, { “@type”: “Question”, “name”: “How does the half-open state probe requests to test service recovery?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “In the half-open state, the circuit breaker allows a small, configurable number of trial requests (e.g., 1-5) to pass through to the real dependency. These probes are metered so that a recovering service is not immediately flooded with traffic. Success is evaluated against a separate threshold (e.g., all probe calls must succeed, or 80% must succeed). If the success criteria are met, the breaker closes and full traffic resumes. If any probe fails (or the failure rate threshold is breached), the breaker immediately returns to Open and resets the sleep timer. Implementations like Resilience4j allow configuring the number of permitted calls in half-open state independently from the closed-state window.” } }, { “@type”: “Question”, “name”: “How should fallback responses be designed for graceful degradation?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Fallbacks should return the best possible response given the absence of the dependency, in priority order: (1) a cached response from the last successful call (stale-if-error), (2) a default or static response (empty list, zero count, generic message), (3) a response from a secondary/degraded data source, or (4) a fast failure with a user-facing error that explains the degraded state. Fallbacks must be fast and free of their own I/O calls that could also fail. They should be clearly observable (logged, metered) so engineers can distinguish fallback traffic from normal traffic and prioritize fixing the root cause.” } }, { “@type”: “Question”, “name”: “How does the bulkhead pattern provide resource isolation per dependency?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “The bulkhead pattern partitions resources (thread pools, semaphore permits, connection pool slots) so that one slow or failing dependency cannot exhaust resources shared by other dependencies. Thread-pool bulkheads assign a dedicated fixed-size thread pool to each downstream service; if that pool saturates, calls to that dependency are rejected without affecting other dependencies’ pools. Semaphore bulkheads are lighter-weight, bounding concurrent in-flight calls per dependency without the overhead of thread context switching, but provide no timeout isolation. Bulkheads are typically combined with circuit breakers: the bulkhead prevents resource exhaustion while the circuit breaker stops calls to already-failed dependencies.” } }, { “@type”: “Question”, “name”: “How are circuit breaker metrics integrated with observability platforms?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Circuit breakers should emit the following metrics: current state (closed/open/half-open) as a gauge, call volume by outcome (success, failure, timeout, rejected, fallback) as counters, failure rate percentage, and state transition events as discrete events or log entries. These are exported to time-series systems (Prometheus, StatsD, CloudWatch) and visualized in dashboards (Grafana). Alerts fire when a circuit opens (indicating a live dependency failure) or when fallback rate rises above a threshold. Distributed tracing should tag spans with circuit breaker outcome so that fallback calls are visible in trace waterfalls. Libraries like Resilience4j and Hystrix expose these metrics via Micrometer, making integration with any observability backend straightforward.” } } ] }
Scroll to Top