What is a circuit breaker pattern and why is it used in distributed systems?

The circuit breaker pattern prevents a service from repeatedly attempting an operation that is likely to fail, protecting the system from cascading failures. When a downstream dependency becomes slow or unavailable, the circuit breaker trips and short-circuits subsequent calls, returning a fast failure or fallback response instead of waiting for a timeout. This preserves resources, keeps latency predictable, and gives the failing dependency time to recover without being overwhelmed by continued traffic.

What are the three states of a circuit breaker and how do transitions work?

A circuit breaker has three states: Closed, Open, and Half-Open. In the Closed state, requests pass through normally and failures are counted. When failures exceed a configured threshold within a time window, the breaker transitions to Open, blocking all requests and returning failures immediately. After a cooldown period, the breaker moves to Half-Open, allowing a limited number of probe requests through. If those succeed, the breaker returns to Closed; if they fail, it returns to Open and the cooldown resets.

How does a circuit breaker service share state across multiple instances?

When multiple instances of a service each maintain their own local circuit breaker state, a failure threshold may never be reached on any single instance even though collectively the dependency is degraded. To share state, instances publish their failure counts and breaker status to a central store such as Redis or a distributed cache. Each instance reads the aggregated state and makes trip decisions based on cluster-wide metrics. Alternatively, a sidecar or service mesh (like Envoy) can centralize circuit breaker logic outside the application entirely.

How is the half-open state used to probe service recovery?

The half-open state acts as a controlled recovery probe. After the circuit has been open for the configured cooldown duration, the breaker allows a small number of requests — often just one — to pass through to the dependency. If those requests succeed within acceptable latency, it signals that the dependency has recovered and the breaker closes, resuming normal traffic. If the probes fail, the breaker trips back to open and restarts the cooldown. This approach avoids hammering a recovering service with full traffic immediately after it comes back online.

Low Level Design: Circuit Breaker Service

⏱ 9 min read

A circuit breaker is a stability pattern that stops a system from hammering a failing dependency, giving it time to recover. The naive version is a single boolean in memory; the production version tracks error rates across a sliding window, coordinates state across multiple service instances, exposes metrics, and supports configurable fallback strategies. This post designs it end-to-end.

Requirements

Wrap outbound calls to any downstream dependency (HTTP, gRPC, DB, cache).
Three states: CLOSED (normal), OPEN (short-circuit, fail fast), HALF-OPEN (probe for recovery).
Error rate threshold: open the breaker when error rate exceeds X% over a rolling window of N calls or T seconds.
Automatic transition: OPEN -> HALF-OPEN after a cooldown duration.
In HALF-OPEN, allow a probe request; on success close the breaker, on failure re-open.
Pluggable fallback: return cached response, default value, or propagate error.
Distributed state: all instances of a service share breaker state to avoid thundering herds.

Data Model

circuit_breakers

CREATE TABLE circuit_breakers (
  id              BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  service_id      VARCHAR(128) NOT NULL,
  dependency_name VARCHAR(128) NOT NULL,
  state           ENUM('CLOSED','OPEN','HALF_OPEN') NOT NULL DEFAULT 'CLOSED',
  error_threshold TINYINT UNSIGNED NOT NULL DEFAULT 50 COMMENT 'percent',
  volume_threshold SMALLINT UNSIGNED NOT NULL DEFAULT 20 COMMENT 'min calls in window',
  window_seconds  SMALLINT UNSIGNED NOT NULL DEFAULT 60,
  cooldown_seconds SMALLINT UNSIGNED NOT NULL DEFAULT 30,
  half_open_probes TINYINT UNSIGNED NOT NULL DEFAULT 1,
  fallback_strategy ENUM('CACHE','DEFAULT','ERROR') NOT NULL DEFAULT 'ERROR',
  opened_at       DATETIME NULL,
  updated_at      DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  UNIQUE KEY uq_service_dep (service_id, dependency_name)
);

call_records (time-series, partitioned by day)

CREATE TABLE call_records (
  id            BIGINT UNSIGNED AUTO_INCREMENT,
  breaker_id    BIGINT UNSIGNED NOT NULL,
  recorded_at   DATETIME(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  success       TINYINT(1) NOT NULL,
  latency_ms    SMALLINT UNSIGNED NOT NULL,
  PRIMARY KEY (id, recorded_at),
  INDEX idx_breaker_time (breaker_id, recorded_at)
) PARTITION BY RANGE (TO_DAYS(recorded_at)) (
  PARTITION p_current VALUES LESS THAN (TO_DAYS(NOW()) + 1),
  PARTITION p_next    VALUES LESS THAN (TO_DAYS(NOW()) + 2)
);

state_transitions

CREATE TABLE state_transitions (
  id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  breaker_id  BIGINT UNSIGNED NOT NULL,
  from_state  ENUM('CLOSED','OPEN','HALF_OPEN') NOT NULL,
  to_state    ENUM('CLOSED','OPEN','HALF_OPEN') NOT NULL,
  reason      VARCHAR(255),
  transitioned_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_breaker (breaker_id, transitioned_at)
);

fallback_cache

CREATE TABLE fallback_cache (
  breaker_id  BIGINT UNSIGNED NOT NULL,
  cache_key   VARCHAR(255) NOT NULL,
  response    MEDIUMBLOB NOT NULL,
  cached_at   DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  expires_at  DATETIME NOT NULL,
  PRIMARY KEY (breaker_id, cache_key)
);

State Machine

          volume & error rate > threshold
CLOSED  ----------------------------------------> OPEN
  ^                                                  |
  |  probe success (HALF_OPEN)                       | cooldown elapsed
  |                                                  v
CLOSED <---------------------------------------- HALF_OPEN
          probe failure -> OPEN again

Transition Logic

CLOSED -> OPEN: A background evaluator runs every second per breaker. It queries call_records for the last window_seconds seconds. If total_calls >= volume_threshold AND (error_calls / total_calls) >= error_threshold / 100.0, it acquires a distributed lock (Redis SETNX on breaker_id) and updates the state to OPEN, recording opened_at.

OPEN -> HALF_OPEN: Any call that arrives while OPEN checks whether NOW() – opened_at >= cooldown_seconds. If true, one instance wins a Redis atomic compare-and-swap to transition to HALF_OPEN. Other instances continue to short-circuit until the probe resolves.

HALF_OPEN -> CLOSED: The probe call is made. On success, state transitions to CLOSED and the error counters reset.

HALF_OPEN -> OPEN: On probe failure, state transitions back to OPEN with a refreshed opened_at. Cooldown timer restarts.

Error Rate Tracking

Two strategies are common:

Count-based sliding window: Keep the last N call outcomes in a circular buffer in Redis (LPUSH + LTRIM). Compute error rate on each call. Low latency, no DB writes on the hot path.
Time-based sliding window: Store call outcomes with timestamps. Query the last T seconds. More accurate for bursty traffic patterns. The call_records table supports this; for high throughput, write to Redis first and flush to DB asynchronously.

Production recommendation: use a count-based window in Redis for real-time evaluation and replicate to the DB for audit and dashboards.

Redis data structure for count-based window:

Key: cb:{breaker_id}:window
Type: List
Format: each element is '1' (success) or '0' (error)
On each call: LPUSH key result; LTRIM key 0 (N-1); LLEN key; LRANGE key 0 -1 to count errors.

Without coordination, instance A may close the breaker while instance B still sees it as OPEN. The pattern:

Redis as the source of truth for state: Store cb:{breaker_id}:state as a string (CLOSED/OPEN/HALF_OPEN) with an optional expiry for OPEN (= cooldown_seconds). When the key expires, the state automatically becomes HALF_OPEN eligible.
Local cache with TTL: Each instance caches the state for 500 ms to avoid a Redis round-trip on every call. Acceptable inconsistency window: sub-second.
State change pub/sub: When any instance changes state, it publishes to a Redis channel cb:state-change:{breaker_id}. Other instances invalidate their local cache immediately.

Fallback Strategies

CACHE: Return the last successful response stored in fallback_cache for the same logical request key. Stale but functional. Expire cache entries after a configurable TTL (default 5 minutes).
DEFAULT: Return a hardcoded default value defined at registration time (e.g., empty list, zero balance, feature flag = off). Simple and predictable.
ERROR: Propagate a CircuitOpenException to the caller immediately without waiting. Callers must handle this explicitly. Forces upstream services to implement their own fallback.

Key Design Decisions and Trade-offs

Per-instance vs. distributed breaker: A per-instance breaker opens faster (no coordination latency) but causes thundering herd when 100 instances simultaneously transition to HALF_OPEN and all send probe requests. A distributed breaker serializes the probe but requires Redis. Use distributed for shared dependencies with rate limits.
Volume threshold: Without a minimum call volume, a single failed call on a cold system opens the breaker. The volume_threshold prevents this. Set it based on expected RPS * window_seconds * 0.1 to require at least 10% of expected traffic before evaluating.
Latency-based opening: Errors alone miss slow dependencies. Add a percentile latency threshold (P99 > 2 s = error equivalent). Record latency in call_records and include it in the evaluator.
Bulkhead integration: Combine the circuit breaker with a thread pool or semaphore bulkhead. Cap concurrent calls to the dependency at max_concurrent. Rejections from the bulkhead count as errors toward the breaker threshold.

Failure Handling and Edge Cases

Redis unavailable: Fall back to local in-process state. Log a warning. Do not open the breaker based on Redis errors alone.
Evaluator lag: The background evaluator checks every second; a sudden error spike can cause up to 1 s of additional calls before the breaker opens. Reduce evaluator interval to 100 ms for critical dependencies, accepting higher CPU cost.
Probe request fails due to unrelated reason: A probe that fails due to a client-side bug (e.g., bad request) should not re-open the breaker. Classify errors: infrastructure errors (connection refused, timeout) count; business errors (404, 400) do not.
Half-open probe storms: Multiple instances may all detect cooldown expiry simultaneously. Use a Redis SETNX lock with a 5 s TTL to ensure only one instance sends the probe. Others return fallback for that duration.
Breaker configuration drift: If thresholds change at runtime, the existing sliding window may immediately trigger a transition. Apply new config only to fresh windows.

Scalability Considerations

Many dependencies: At 1000 breakers each evaluated every second, that is 1000 Redis reads per second — negligible. The evaluator can run as a single lightweight process (circuit breaker control plane) rather than in every application instance.
High-throughput hot path: The per-call overhead is one local cache lookup (sub-microsecond) plus, on cache miss, one Redis GET (~0.5 ms). For 100k RPS, worst-case cache miss rate of 0.2% = 200 Redis calls/s — well within Redis limits.
Metrics and observability: Export breaker state, error rate, and latency percentiles to Prometheus. Alert on OPEN state duration > 5 minutes (dependency not recovering) and on HALF_OPEN -> OPEN repeated transitions (flapping).
Multi-region: Each region runs its own Redis cluster for breaker state. Cross-region sharing is rarely needed because network partitions between regions should open the breaker independently per region.

Summary

A production circuit breaker service is more than a flag. It requires a precise state machine, a calibrated sliding window for error rate tracking, distributed state coordination to avoid thundering herds, and a fallback layer that degrades gracefully. The core trade-off is between reaction speed (short windows, frequent evaluation) and stability (minimum volume thresholds, probe serialization). Get these parameters right per dependency and the circuit breaker becomes one of the highest-leverage reliability primitives in your stack.