Circuit Breaker Pattern — Low-Level Design
A circuit breaker prevents cascading failures by stopping requests to a failing downstream service. When error rates exceed a threshold, the breaker “opens” and returns failures immediately without making network calls. This pattern is asked at Netflix, Amazon, and any company operating microservices at scale.
Circuit Breaker States
CLOSED → Normal operation. Requests pass through.
Failure counter increments on each error.
When failures exceed threshold: transition to OPEN.
OPEN → Short-circuit. Requests fail immediately (no network call).
After a timeout (e.g., 30 seconds): transition to HALF_OPEN.
HALF_OPEN → Probe state. Allow a limited number of requests through.
If they succeed: transition back to CLOSED (reset counters).
If they fail: transition back to OPEN (restart timeout).
Implementation
import time
import threading
from enum import Enum
class State(Enum):
CLOSED = 'closed'
OPEN = 'open'
HALF_OPEN = 'half_open'
class CircuitBreaker:
def __init__(self, name, failure_threshold=5, recovery_timeout=30,
half_open_max_calls=3):
self.name = name
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self._state = State.CLOSED
self._failure_count = 0
self._last_failure_time = None
self._half_open_calls = 0
self._lock = threading.Lock()
@property
def state(self):
with self._lock:
if self._state == State.OPEN:
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = State.HALF_OPEN
self._half_open_calls = 0
return self._state
def call(self, func, *args, **kwargs):
state = self.state
if state == State.OPEN:
raise CircuitOpenError(f'Circuit {self.name} is OPEN')
if state == State.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self.half_open_max_calls:
raise CircuitOpenError(f'Circuit {self.name} is HALF_OPEN (probe limit)')
self._half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self._lock:
self._failure_count = 0
self._state = State.CLOSED
def _on_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self._state = State.OPEN
# Usage
payment_breaker = CircuitBreaker('payment-service', failure_threshold=5)
def charge_customer(user_id, amount):
try:
return payment_breaker.call(payment_service.charge, user_id, amount)
except CircuitOpenError:
# Fallback: queue for retry later
queue_charge_for_retry(user_id, amount)
raise ServiceUnavailable('Payment service temporarily unavailable')
Sliding Window Failure Rate
# Simple counter (above) is vulnerable to bursts at reset boundaries
# Better: sliding window over the last N requests
from collections import deque
class SlidingWindowBreaker:
def __init__(self, name, window_size=20, failure_rate_threshold=0.5):
self.name = name
self.window_size = window_size
self.failure_rate_threshold = failure_rate_threshold
self._results = deque(maxlen=window_size) # True=success, False=failure
self._state = State.CLOSED
self._open_until = None
self._lock = threading.Lock()
def _failure_rate(self):
if not self._results:
return 0.0
return self._results.count(False) / len(self._results)
def _on_result(self, success):
with self._lock:
self._results.append(success)
if len(self._results) == self.window_size:
if self._failure_rate() >= self.failure_rate_threshold:
self._state = State.OPEN
self._open_until = time.time() + 30
Distributed Circuit Breaker (Redis-Backed)
# In a microservice with many instances, each instance having its own
# in-memory breaker means the circuit won't open until EACH instance
# accumulates enough failures independently.
# Solution: share state via Redis.
def record_result(service_name, success):
key = f'cb:{service_name}:results'
pipe = redis.pipeline()
pipe.rpush(key, 1 if success else 0)
pipe.ltrim(key, -20, -1) # Keep last 20 results
pipe.expire(key, 60)
pipe.execute()
def get_failure_rate(service_name):
results = redis.lrange(f'cb:{service_name}:results', 0, -1)
if len(results) < 10:
return 0.0 # Not enough data
failures = results.count(b'0')
return failures / len(results)
def is_open(service_name):
return redis.exists(f'cb:{service_name}:open')
def trip_breaker(service_name, timeout=30):
redis.setex(f'cb:{service_name}:open', timeout, 1)
Key Interview Points
- Three states are mandatory: HALF_OPEN is essential — without it, you either stay open forever or flip back fully closed after one success, neither of which is correct.
- Distinguish circuit open from service error: The caller must handle CircuitOpenError differently from a real downstream error. Circuit open → fallback to cache or degraded response. Service error → may retry.
- Per-instance vs shared state: In-memory breakers protect each instance but require independent failure accumulation. Redis-backed shared state opens faster but adds a Redis dependency. Choose based on whether protecting the downstream or protecting your own instance is the priority.
- Timeout tuning: Recovery timeout (30s) must be longer than the downstream service’s restart time. Failure threshold (5 errors) must be low enough to open before the downstream is overwhelmed.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What are the three states of a circuit breaker and when do they transition?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”CLOSED (normal): requests pass through, failures are counted. When the failure count exceeds the threshold within a window, transition to OPEN. OPEN (short-circuit): requests fail immediately without attempting the network call. After a recovery timeout (e.g., 30 seconds), transition to HALF_OPEN. HALF_OPEN (probe): a limited number of test requests are allowed through. If they succeed, transition back to CLOSED and reset counters. If they fail, transition back to OPEN and restart the timeout. HALF_OPEN is critical — without it, the circuit must either stay open forever or flip fully closed after just one success.”}},{“@type”:”Question”,”name”:”What is the difference between a simple counter and a sliding window circuit breaker?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A simple counter (5 failures → open) has a reset boundary problem: 4 failures at 11:59:59 and 1 failure at 12:00:00 resets the counter, never tripping the breaker despite a clear failure pattern. A sliding window breaker tracks the last N calls (count-based) or last N seconds (time-based) and computes the failure rate. Only trip if, say, 50% of the last 20 calls failed. This is more resilient to burst failures at counter reset boundaries and better handles partial degradation where some requests fail but most succeed.”}},{“@type”:”Question”,”name”:”How does a circuit breaker differ from a retry with backoff?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Retry with backoff retries failed requests with increasing delays, hoping the transient failure resolves. It is appropriate for occasional errors on a healthy service. A circuit breaker stops sending requests entirely when error rates are high, protecting the downstream service from being overwhelmed during an outage. They are complementary: use retry for transient errors (1-2 retries), and let the circuit breaker open if errors are persistent. Without a circuit breaker, aggressive retries during an outage amplify traffic on an already-struggling service — the thundering herd problem.”}},{“@type”:”Question”,”name”:”How do you share circuit breaker state across multiple service instances?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In-memory circuit breakers track state per instance. With 10 instances, each must independently accumulate failures before opening — the downstream may be overloaded for minutes before all instances trip. Solution: use Redis as a shared state store. Record each call result (success/failure) in a Redis list with a sliding window TTL. Check the failure rate from Redis before each call. Use Redis SETEX to set the open state with the recovery timeout as TTL — when the key expires, the breaker enters HALF_OPEN automatically. Trade-off: adds a Redis dependency to the call path.”}},{“@type”:”Question”,”name”:”What should a service do when a circuit breaker is open?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Return a graceful degraded response rather than propagating the error. Options depend on the downstream service: (1) Return cached data from Redis (stale but functional). (2) Return a default/empty response (e.g., empty search results, default recommendations). (3) Queue the request for later processing (fire-and-forget operations). (4) Return a user-facing error with a helpful message ("Service temporarily unavailable, try again in a moment"). Never return a 500 when the circuit is open — that implies a server bug. Return 503 Service Unavailable with a Retry-After header.”}}]}
Circuit breaker and resilience pattern design is discussed in Netflix system design interview questions.
Circuit breaker and fault tolerance design is covered in Amazon system design interview preparation.
Circuit breaker and microservice resilience design is discussed in Uber system design interview guide.