Circuit Breaker Pattern — Low-Level Design
A circuit breaker prevents cascading failures by stopping requests to a failing downstream service. When error rates exceed a threshold, the breaker “opens” and returns failures immediately without making network calls. This pattern is asked at Netflix, Amazon, and any company operating microservices at scale.
Circuit Breaker States
CLOSED → Normal operation. Requests pass through.
Failure counter increments on each error.
When failures exceed threshold: transition to OPEN.
OPEN → Short-circuit. Requests fail immediately (no network call).
After a timeout (e.g., 30 seconds): transition to HALF_OPEN.
HALF_OPEN → Probe state. Allow a limited number of requests through.
If they succeed: transition back to CLOSED (reset counters).
If they fail: transition back to OPEN (restart timeout).
Implementation
import time
import threading
from enum import Enum
class State(Enum):
CLOSED = 'closed'
OPEN = 'open'
HALF_OPEN = 'half_open'
class CircuitBreaker:
def __init__(self, name, failure_threshold=5, recovery_timeout=30,
half_open_max_calls=3):
self.name = name
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self._state = State.CLOSED
self._failure_count = 0
self._last_failure_time = None
self._half_open_calls = 0
self._lock = threading.Lock()
@property
def state(self):
with self._lock:
if self._state == State.OPEN:
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = State.HALF_OPEN
self._half_open_calls = 0
return self._state
def call(self, func, *args, **kwargs):
state = self.state
if state == State.OPEN:
raise CircuitOpenError(f'Circuit {self.name} is OPEN')
if state == State.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self.half_open_max_calls:
raise CircuitOpenError(f'Circuit {self.name} is HALF_OPEN (probe limit)')
self._half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self._lock:
self._failure_count = 0
self._state = State.CLOSED
def _on_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self._state = State.OPEN
# Usage
payment_breaker = CircuitBreaker('payment-service', failure_threshold=5)
def charge_customer(user_id, amount):
try:
return payment_breaker.call(payment_service.charge, user_id, amount)
except CircuitOpenError:
# Fallback: queue for retry later
queue_charge_for_retry(user_id, amount)
raise ServiceUnavailable('Payment service temporarily unavailable')
Sliding Window Failure Rate
# Simple counter (above) is vulnerable to bursts at reset boundaries
# Better: sliding window over the last N requests
from collections import deque
class SlidingWindowBreaker:
def __init__(self, name, window_size=20, failure_rate_threshold=0.5):
self.name = name
self.window_size = window_size
self.failure_rate_threshold = failure_rate_threshold
self._results = deque(maxlen=window_size) # True=success, False=failure
self._state = State.CLOSED
self._open_until = None
self._lock = threading.Lock()
def _failure_rate(self):
if not self._results:
return 0.0
return self._results.count(False) / len(self._results)
def _on_result(self, success):
with self._lock:
self._results.append(success)
if len(self._results) == self.window_size:
if self._failure_rate() >= self.failure_rate_threshold:
self._state = State.OPEN
self._open_until = time.time() + 30
Distributed Circuit Breaker (Redis-Backed)
# In a microservice with many instances, each instance having its own
# in-memory breaker means the circuit won't open until EACH instance
# accumulates enough failures independently.
# Solution: share state via Redis.
def record_result(service_name, success):
key = f'cb:{service_name}:results'
pipe = redis.pipeline()
pipe.rpush(key, 1 if success else 0)
pipe.ltrim(key, -20, -1) # Keep last 20 results
pipe.expire(key, 60)
pipe.execute()
def get_failure_rate(service_name):
results = redis.lrange(f'cb:{service_name}:results', 0, -1)
if len(results) < 10:
return 0.0 # Not enough data
failures = results.count(b'0')
return failures / len(results)
def is_open(service_name):
return redis.exists(f'cb:{service_name}:open')
def trip_breaker(service_name, timeout=30):
redis.setex(f'cb:{service_name}:open', timeout, 1)
Key Interview Points
- Three states are mandatory: HALF_OPEN is essential — without it, you either stay open forever or flip back fully closed after one success, neither of which is correct.
- Distinguish circuit open from service error: The caller must handle CircuitOpenError differently from a real downstream error. Circuit open → fallback to cache or degraded response. Service error → may retry.
- Per-instance vs shared state: In-memory breakers protect each instance but require independent failure accumulation. Redis-backed shared state opens faster but adds a Redis dependency. Choose based on whether protecting the downstream or protecting your own instance is the priority.
- Timeout tuning: Recovery timeout (30s) must be longer than the downstream service’s restart time. Failure threshold (5 errors) must be low enough to open before the downstream is overwhelmed.
Circuit breaker and resilience pattern design is discussed in Netflix system design interview questions.
Circuit breaker and fault tolerance design is covered in Amazon system design interview preparation.
Circuit breaker and microservice resilience design is discussed in Uber system design interview guide.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering