Circuit Breaker Pattern Low-Level Design

Circuit Breaker Pattern — Low-Level Design

A circuit breaker prevents cascading failures by stopping requests to a failing downstream service. When error rates exceed a threshold, the breaker “opens” and returns failures immediately without making network calls. This pattern is asked at Netflix, Amazon, and any company operating microservices at scale.

Circuit Breaker States

CLOSED  → Normal operation. Requests pass through.
          Failure counter increments on each error.
          When failures exceed threshold: transition to OPEN.

OPEN    → Short-circuit. Requests fail immediately (no network call).
          After a timeout (e.g., 30 seconds): transition to HALF_OPEN.

HALF_OPEN → Probe state. Allow a limited number of requests through.
            If they succeed: transition back to CLOSED (reset counters).
            If they fail: transition back to OPEN (restart timeout).

Implementation

import time
import threading
from enum import Enum

class State(Enum):
    CLOSED = 'closed'
    OPEN = 'open'
    HALF_OPEN = 'half_open'

class CircuitBreaker:
    def __init__(self, name, failure_threshold=5, recovery_timeout=30,
                 half_open_max_calls=3):
        self.name = name
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls

        self._state = State.CLOSED
        self._failure_count = 0
        self._last_failure_time = None
        self._half_open_calls = 0
        self._lock = threading.Lock()

    @property
    def state(self):
        with self._lock:
            if self._state == State.OPEN:
                if time.time() - self._last_failure_time > self.recovery_timeout:
                    self._state = State.HALF_OPEN
                    self._half_open_calls = 0
            return self._state

    def call(self, func, *args, **kwargs):
        state = self.state
        if state == State.OPEN:
            raise CircuitOpenError(f'Circuit {self.name} is OPEN')

        if state == State.HALF_OPEN:
            with self._lock:
                if self._half_open_calls >= self.half_open_max_calls:
                    raise CircuitOpenError(f'Circuit {self.name} is HALF_OPEN (probe limit)')
                self._half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self._lock:
            self._failure_count = 0
            self._state = State.CLOSED

    def _on_failure(self):
        with self._lock:
            self._failure_count += 1
            self._last_failure_time = time.time()
            if self._failure_count >= self.failure_threshold:
                self._state = State.OPEN

# Usage
payment_breaker = CircuitBreaker('payment-service', failure_threshold=5)

def charge_customer(user_id, amount):
    try:
        return payment_breaker.call(payment_service.charge, user_id, amount)
    except CircuitOpenError:
        # Fallback: queue for retry later
        queue_charge_for_retry(user_id, amount)
        raise ServiceUnavailable('Payment service temporarily unavailable')

Sliding Window Failure Rate

# Simple counter (above) is vulnerable to bursts at reset boundaries
# Better: sliding window over the last N requests

from collections import deque

class SlidingWindowBreaker:
    def __init__(self, name, window_size=20, failure_rate_threshold=0.5):
        self.name = name
        self.window_size = window_size
        self.failure_rate_threshold = failure_rate_threshold
        self._results = deque(maxlen=window_size)  # True=success, False=failure
        self._state = State.CLOSED
        self._open_until = None
        self._lock = threading.Lock()

    def _failure_rate(self):
        if not self._results:
            return 0.0
        return self._results.count(False) / len(self._results)

    def _on_result(self, success):
        with self._lock:
            self._results.append(success)
            if len(self._results) == self.window_size:
                if self._failure_rate() >= self.failure_rate_threshold:
                    self._state = State.OPEN
                    self._open_until = time.time() + 30

Distributed Circuit Breaker (Redis-Backed)

# In a microservice with many instances, each instance having its own
# in-memory breaker means the circuit won't open until EACH instance
# accumulates enough failures independently.
# Solution: share state via Redis.

def record_result(service_name, success):
    key = f'cb:{service_name}:results'
    pipe = redis.pipeline()
    pipe.rpush(key, 1 if success else 0)
    pipe.ltrim(key, -20, -1)  # Keep last 20 results
    pipe.expire(key, 60)
    pipe.execute()

def get_failure_rate(service_name):
    results = redis.lrange(f'cb:{service_name}:results', 0, -1)
    if len(results) < 10:
        return 0.0  # Not enough data
    failures = results.count(b'0')
    return failures / len(results)

def is_open(service_name):
    return redis.exists(f'cb:{service_name}:open')

def trip_breaker(service_name, timeout=30):
    redis.setex(f'cb:{service_name}:open', timeout, 1)

Key Interview Points

  • Three states are mandatory: HALF_OPEN is essential — without it, you either stay open forever or flip back fully closed after one success, neither of which is correct.
  • Distinguish circuit open from service error: The caller must handle CircuitOpenError differently from a real downstream error. Circuit open → fallback to cache or degraded response. Service error → may retry.
  • Per-instance vs shared state: In-memory breakers protect each instance but require independent failure accumulation. Redis-backed shared state opens faster but adds a Redis dependency. Choose based on whether protecting the downstream or protecting your own instance is the priority.
  • Timeout tuning: Recovery timeout (30s) must be longer than the downstream service’s restart time. Failure threshold (5 errors) must be low enough to open before the downstream is overwhelmed.

Circuit breaker and resilience pattern design is discussed in Netflix system design interview questions.

Circuit breaker and fault tolerance design is covered in Amazon system design interview preparation.

Circuit breaker and microservice resilience design is discussed in Uber system design interview guide.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top