Question 1

What are the three states of a circuit breaker and when does it transition between them?

Accepted Answer

CLOSED (normal): all requests pass through to the downstream service. Failures are counted. When the failure rate exceeds the threshold (e.g., 50% failures over the last 60 seconds with a minimum of 20 requests), transition to OPEN. OPEN (tripping): all requests are rejected immediately without calling the downstream service. A fallback is returned. After a configured timeout (e.g., 30 seconds), transition to HALF-OPEN. HALF-OPEN (probe): a limited number of test requests are allowed through. If they succeed, transition back to CLOSED (service recovered). If they fail, transition back to OPEN (still failing, reset the timeout). The minimum request threshold prevents the circuit from opening on the first few failures when traffic just started.

Question 2

Why do you need jitter in exponential backoff?

Accepted Answer

Without jitter: when a service fails, all clients back off for the same duration (1s, then 2s, then 4s). When the backoff timer expires, all clients retry simultaneously -- a synchronized thundering herd that may overwhelm the recovering service just as it comes back up. With jitter (randomized backoff): each client waits for a slightly different duration -- actual_wait = base_backoff * random(0.5, 1.5). Retries are spread out over time, reducing the peak load on the recovering service. Full jitter: sleep = random(0, min(cap, base * 2^attempt)). Equal jitter: sleep = min(cap, base * 2^attempt) / 2 + random(0, min(cap, base * 2^attempt) / 2). AWS recommends full jitter for most use cases. The goal is to spread client retries across the recovery window rather than concentrating them at fixed intervals.

Question 3

How does the bulkhead pattern prevent cascading failures?

Accepted Answer

Without bulkheads: all microservice calls share a single thread pool. If service B becomes slow, threads in the shared pool block waiting for B. The pool fills up. Service C calls also block (no threads available), even though C is healthy. The failure cascades. With thread pool bulkheads: service B calls use a dedicated pool of 10 threads. Service C calls use a separate pool of 5 threads. When B is slow and all 10 B-threads are blocked, C calls still have their 5 threads and continue normally. The failure is contained to the B pool. Each downstream service dependency gets its own isolated resource pool. Size each pool based on the expected concurrency for that dependency (target calls per second * response time). Semaphore bulkheads are lighter-weight -- limit concurrent in-flight calls without using separate thread pools (suitable when the calls are non-blocking/async).

Question 4

What is the difference between a timeout and a circuit breaker?

Accepted Answer

Timeouts are defensive: they prevent a single call from blocking indefinitely. A timeout fires after a fixed duration and releases the thread. But timeouts alone don't prevent the next call from also blocking for that duration. If 100 requests/second are hitting a hung downstream service, 100 threads are blocked for the timeout duration -- the pool exhausts. Circuit breakers are offensive: once the failure rate is high, they stop making calls at all. The circuit opens, and subsequent calls fail immediately (sub-millisecond) without consuming a thread or waiting. Timeouts protect individual calls; circuit breakers protect the system from repeatedly hitting a failing dependency. Use both together: timeout config feeds the circuit breaker's failure metrics (timeout = failure for circuit breaker purposes).

Question 5

How do you implement a fallback strategy when a circuit is open?

Accepted Answer

The fallback is the response returned when the circuit is open or a call fails after retries. The right fallback depends on the downstream service: cached response (return the last successful response for this request, stored in Redis or in-memory cache with a TTL), default value (empty recommendations list, zero inventory count), degraded response (return a simplified version of the data from an alternative source), or fail fast (re-throw a specific error that the caller handles gracefully -- shows a friendly error page rather than a spinner). Design for partial functionality: the product page can render without reviews (show "reviews temporarily unavailable"). The checkout page cannot render without inventory data -- fail clearly. Never silently swallow errors in a fallback -- log the circuit-open event and track the fallback rate as a metric.

System Design: Circuit Breaker, Retry, and Bulkhead — Resilience Patterns for Microservices

Why Resilience Patterns?

Circuit Breaker

Retry with Exponential Backoff and Jitter

Bulkhead Pattern

Timeout

Fallback Strategies