Question 1

How does a circuit breaker prevent cascading failures in microservices?

Accepted Answer

Without a circuit breaker: service A calls service B. Service B is slow (overloaded). Service A threads are blocked waiting for B responses. Service A thread pool is exhausted. Service A can no longer handle any requests, including those that do not depend on B. Service C, which depends on A, also fails. The failure cascades through the entire system. With a circuit breaker: the circuit breaker monitors the error rate of calls from A to B. When the error rate exceeds a threshold (e.g., 50% of calls fail over a 10-second window), the circuit opens. In the open state, all calls from A to B immediately fail with a fallback response without actually contacting B. This frees A threads from blocking, allows A to continue serving other requests, and gives B time to recover. After a timeout (e.g., 30 seconds), the circuit enters half-open state: one probe request is sent to B. If it succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit stays open for another timeout period. Implementation: Resilience4j (Java), Polly (.NET), or at the infrastructure level with Istio/Envoy DestinationRule outlierDetection (transparent to application code).

Question 2

Why is exponential backoff with jitter important for retries?

Accepted Answer

Without backoff: if a service fails and 1000 clients retry immediately, the failing service receives 1000 retry requests on top of its normal load -- a retry storm that makes the problem worse. Without jitter: with exponential backoff but no jitter, all 1000 clients retry at the same time (all wait 100ms, then all wait 200ms, then all wait 400ms). The retries are synchronized, creating periodic load spikes that overwhelm the recovering service. Exponential backoff spreads retries over time: wait 100ms, 200ms, 400ms, 800ms. Each retry waits longer, giving the service more time to recover. Jitter adds randomness: wait = base_delay * 2^attempt * random(0.5, 1.5). The 1000 clients now retry at different times, spreading the load evenly. Full jitter (AWS recommendation): wait = random(0, base_delay * 2^attempt). This provides maximum spread. Additionally, implement a retry budget: limit retries to 20% of total requests. If 1000 requests per second are normal and 200 fail, allow at most 200 retries (20% of 1000), not 200 retries per failed request. This caps the amplification factor at 1.2x instead of allowing unbounded growth.

Question 3

When should you use synchronous HTTP versus asynchronous messaging between microservices?

Accepted Answer

Use synchronous HTTP/gRPC when: (1) The caller needs the result to proceed -- the checkout service must wait for the payment service to confirm the charge before showing the order confirmation. (2) The operation is fast (under 500ms) -- slow synchronous calls create poor user experience and risk timeouts. (3) The interaction is request-response (query or command with an immediate result). Use asynchronous messaging (Kafka, RabbitMQ) when: (1) The caller does not need an immediate result -- after placing an order, sending the confirmation email can happen asynchronously. The user sees the order confirmation page without waiting for the email. (2) Load leveling is needed -- during a traffic spike, messages queue up and consumers process them at their own pace. The queue absorbs the burst without overwhelming the consumer. (3) Multiple consumers need to react -- an OrderPlaced event triggers inventory update, email notification, analytics recording, and fraud check. With Kafka, all consumers receive the event independently. (4) Reliability matters more than latency -- messages persist in the queue even if the consumer is temporarily down. The message is processed when the consumer recovers. Hybrid approach: use synchronous for the critical path (user-facing request), asynchronous for side effects (notifications, analytics, auditing).

Question 4

What is a service mesh and when is it worth the complexity?

Accepted Answer

A service mesh is an infrastructure layer that handles service-to-service communication transparently using sidecar proxies. Istio (with Envoy sidecars) is the most popular implementation. The sidecar proxy intercepts all network traffic to and from the application container. It provides: mutual TLS (encrypted, authenticated communication between all services without application code changes), traffic management (circuit breaking, retries, timeouts, canary routing configured declaratively), and observability (automatic metrics, distributed tracing, and access logs for every service call). When it is worth it: (1) You have 20+ microservices and implementing retries, circuit breakers, and mTLS in each language/framework is duplicative. (2) You need zero-trust security (mTLS everywhere) and cannot modify every application. (3) You need fine-grained traffic control (canary deployments, header-based routing) across services. When it is not worth it: (1) You have fewer than 10 services -- the operational overhead of managing Istio exceeds the benefit. (2) Your team lacks Kubernetes expertise -- debugging service mesh issues requires deep knowledge. (3) Latency is critical -- each hop through the sidecar proxy adds 1-3ms. For latency-sensitive paths, direct communication may be necessary. Start without a mesh, adopt it when the pain of managing cross-cutting concerns manually exceeds the pain of operating the mesh.

System Design: Microservices Communication — Sync vs Async, Service Mesh, Circuit Breaker, Retry, Timeout, Backpressure

Synchronous Communication: HTTP and gRPC

Asynchronous Communication: Message Queues and Event Streams

Circuit Breaker Pattern

Retry and Timeout Strategies

Service Mesh: Istio and Envoy

Backpressure and Load Shedding