Question 1

How does the circuit breaker state machine transition between states and what thresholds trigger each transition?

Accepted Answer

A circuit breaker has three states. Closed (normal operation): requests pass through; failures are counted in a sliding window (count- or time-based). When the failure rate or count exceeds a configured threshold (e.g., 50% failure rate over the last 10 calls, or 5 consecutive failures), the circuit transitions to Open. Open: all requests are immediately rejected or redirected to a fallback without attempting the downstream call, preventing cascade failure and giving the dependency time to recover. After a configurable sleep window (e.g., 30 seconds), the circuit transitions to Half-Open. Half-Open: a limited number of probe requests are allowed through. If they succeed, the circuit closes; if they fail, it returns to Open.

Question 2

How does the half-open state probe requests to test service recovery?

Accepted Answer

In the half-open state, the circuit breaker allows a small, configurable number of trial requests (e.g., 1-5) to pass through to the real dependency. These probes are metered so that a recovering service is not immediately flooded with traffic. Success is evaluated against a separate threshold (e.g., all probe calls must succeed, or 80% must succeed). If the success criteria are met, the breaker closes and full traffic resumes. If any probe fails (or the failure rate threshold is breached), the breaker immediately returns to Open and resets the sleep timer. Implementations like Resilience4j allow configuring the number of permitted calls in half-open state independently from the closed-state window.

Question 3

How should fallback responses be designed for graceful degradation?

Accepted Answer

Fallbacks should return the best possible response given the absence of the dependency, in priority order: (1) a cached response from the last successful call (stale-if-error), (2) a default or static response (empty list, zero count, generic message), (3) a response from a secondary/degraded data source, or (4) a fast failure with a user-facing error that explains the degraded state. Fallbacks must be fast and free of their own I/O calls that could also fail. They should be clearly observable (logged, metered) so engineers can distinguish fallback traffic from normal traffic and prioritize fixing the root cause.

Question 4

How does the bulkhead pattern provide resource isolation per dependency?

Accepted Answer

The bulkhead pattern partitions resources (thread pools, semaphore permits, connection pool slots) so that one slow or failing dependency cannot exhaust resources shared by other dependencies. Thread-pool bulkheads assign a dedicated fixed-size thread pool to each downstream service; if that pool saturates, calls to that dependency are rejected without affecting other dependencies' pools. Semaphore bulkheads are lighter-weight, bounding concurrent in-flight calls per dependency without the overhead of thread context switching, but provide no timeout isolation. Bulkheads are typically combined with circuit breakers: the bulkhead prevents resource exhaustion while the circuit breaker stops calls to already-failed dependencies.

Question 5

How are circuit breaker metrics integrated with observability platforms?

Accepted Answer

Circuit breakers should emit the following metrics: current state (closed/open/half-open) as a gauge, call volume by outcome (success, failure, timeout, rejected, fallback) as counters, failure rate percentage, and state transition events as discrete events or log entries. These are exported to time-series systems (Prometheus, StatsD, CloudWatch) and visualized in dashboards (Grafana). Alerts fire when a circuit opens (indicating a live dependency failure) or when fallback rate rises above a threshold. Distributed tracing should tag spans with circuit breaker outcome so that fallback calls are visible in trace waterfalls. Libraries like Resilience4j and Hystrix expose these metrics via Micrometer, making integration with any observability backend straightforward.

Low Level Design: Circuit Breaker Pattern

State Machine

Failure Detection

Timeout Handling

Fallback Responses

Half-Open Probing

Bulkhead Pattern

Metrics and Observability

Service Mesh Integration