System Design Interview: Service Mesh and Microservice Communication

⏱ 9 min read

What Is a Service Mesh?

A service mesh is a dedicated infrastructure layer for managing service-to-service communication in a microservices architecture. Instead of baking network policies (retries, timeouts, circuit breaking, mTLS, tracing) into each service’s code, a service mesh moves these concerns to a sidecar proxy that runs alongside each service pod. Services communicate with each other through these proxies — the network behavior is configured centrally without changing application code.

Architecture: Control Plane and Data Plane

Data plane: sidecar proxies (Envoy) running alongside each service pod. All inbound and outbound network traffic passes through the sidecar. The sidecar enforces policies (mTLS, retries, rate limits) and collects telemetry (request counts, latency histograms, trace spans).

Control plane: the management component (Istio’s istiod) that configures all the sidecar proxies. Operators apply high-level configuration (VirtualService, DestinationRule) to the control plane via Kubernetes CRDs; the control plane translates these into xDS (Envoy’s API) configurations and pushes them to all proxies in real time. No proxy restart needed — configuration changes propagate in seconds.


# Each pod gets an Envoy sidecar injected:
metadata:
  labels:
    istio-injection: enabled  # Istio auto-injects Envoy into all pods in this namespace

# Pod structure:
+----pod--------------+
| app-container :8080 |  <- your service
| envoy-proxy   :15001|  <- sidecar intercepts all traffic
+---------------------+

Mutual TLS (mTLS)

Without a service mesh, service-to-service communication is often plain HTTP inside the cluster — any compromised service can eavesdrop or spoof other services. Istio automatically provides mTLS for all service communication: each sidecar has a short-lived X.509 certificate (issued by Istio’s CA via SPIFFE/SPIRE). When Service A calls Service B, Envoy terminates and initiates a mutual TLS handshake — both sides verify the other’s identity. The application code sees plain HTTP; the sidecar transparently handles TLS. This implements zero-trust networking: every service must prove its identity on every request, regardless of which network segment it’s on.


# Enforce strict mTLS in a namespace:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT  # reject plain HTTP entirely

Traffic Management

Canary Deployments


apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: product-service
spec:
  http:
  - route:
    - destination:
        host: product-service
        subset: v2
      weight: 10     # 10% to new version
    - destination:
        host: product-service
        subset: v1
      weight: 90     # 90% to stable
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: product-service
spec:
  host: product-service
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

Fault Injection for Chaos Engineering


# Inject 5% error rate and 100ms delay into product-service calls:
spec:
  http:
  - fault:
      delay:
        percentage:
          value: 10.0
        fixedDelay: 100ms
      abort:
        percentage:
          value: 5.0
        httpStatus: 503
    route:
    - destination:
        host: product-service

Circuit Breaking


apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5      # open circuit after 5 consecutive errors
      interval: 30s                # check interval
      baseEjectionTime: 30s        # eject for 30s before retrying
      maxEjectionPercent: 50       # never eject more than 50% of endpoints

Observability: The Service Mesh Advantage

With Envoy sidecar on every pod, the mesh automatically generates L7 (HTTP, gRPC) telemetry for all service communication without any code changes:

Metrics: requests/second, error rate, P50/P99 latency for every service-to-service call pair. Exported to Prometheus; visualized in Grafana.
Distributed tracing: Envoy automatically propagates trace context headers (B3, W3C TraceContext) and sends trace spans to Jaeger or Zipkin. Full request path across 10 services is visible in one trace.
Access logs: per-request structured logs with request ID, latency, response code, bytes sent/received — without adding logging to each service.
Service topology: Kiali (Istio’s dashboard) visualizes the live service graph with health status and traffic flow.

When NOT to Use a Service Mesh

Service meshes add significant operational complexity:

Latency overhead: each request traverses two additional Envoy proxies (source and destination sidecar). Adds 1-5ms per hop. For latency-sensitive paths (under 10ms budget), this may be unacceptable.
Memory overhead: each Envoy sidecar uses 50-200MB of RAM. With 500 pods, this is 25-100GB of additional memory — significant cost.
Operational complexity: the control plane is another system to manage; xDS configuration is verbose; debugging mesh-level issues (certificate expiry, route misconfigurations) requires expertise.
Not needed for monoliths or small services: if you have 5-10 services with simple communication patterns, the overhead is not worth it. Use it when: you have 20+ services, security (mTLS) is a hard requirement, or you need fine-grained traffic control.

Alternatives to a full mesh: libraries like Hystrix (circuit breaking in code), mutual TLS with cert-manager, or Kubernetes Network Policies for basic traffic isolation.

Envoy vs Nginx as Sidecar

Nginx and HAProxy are excellent L4/L7 proxies but were designed for static configuration. Envoy was built from the ground up for dynamic configuration (xDS APIs), L7 awareness (gRPC, HTTP/2, WebSocket), and observability (native Prometheus metrics, trace propagation). Envoy is the sidecar of choice for service meshes (Istio, Linkerd uses its own lightweight Rust proxy). Nginx serves as an ingress controller (external traffic entering the cluster) while Envoy handles east-west (service-to-service) traffic.

Key Interview Points

Service mesh = Envoy sidecar per pod (data plane) + control plane (Istio istiod) pushing config via xDS
mTLS is enforced transparently by sidecars — zero-trust without code changes
VirtualService + DestinationRule enable canary splits, header-based routing, and fault injection
Automatic L7 metrics, tracing, and access logs for all service pairs without instrumentation
Latency and memory overhead (~5ms, ~100MB/pod) — use only when complexity is justified (20+ services, strict security requirements)

Frequently Asked Questions

How does a service mesh implement mutual TLS without changing application code?

The service mesh's sidecar proxy (Envoy) handles TLS transparently, below the application layer. When Istio is enabled in a namespace, the Kubernetes admission webhook automatically injects an Envoy sidecar container into every pod. The operating system is configured via iptables rules to redirect all inbound and outbound TCP traffic through the Envoy proxy before it reaches the application container. The application still binds to localhost on its configured port and makes outbound connections using regular HTTP — it has no knowledge of TLS. Envoy intercepts these connections and: for outbound calls, initiates a mutual TLS handshake with the destination pod's Envoy using a short-lived X.509 certificate issued by Istio's certificate authority (Citadel/istiod). For inbound calls, Envoy terminates the mTLS session and forwards plain HTTP to the application on localhost. The certificates are automatically rotated (default every 24 hours) by Istio. The identity encoded in the certificate is the Kubernetes Service Account — service A can only authenticate as its own service account, preventing impersonation. This means zero application code changes are needed to get encrypted, authenticated service-to-service communication.

What is the overhead of adding an Envoy sidecar proxy to every pod?

Adding an Envoy sidecar to every pod incurs resource overhead in three dimensions: (1) Latency: each request traverses two additional Envoy proxies — the source sidecar (outbound) and the destination sidecar (inbound). Envoy adds approximately 0.5-5ms per hop depending on request size and policy complexity. For services with a 10ms P99 budget, this is significant. For services with 100ms+ budgets, negligible. (2) Memory: each Envoy sidecar uses 50-200MB of RSS depending on the number of clusters (upstream services) in its xDS configuration. With 500 pods, this is 25-100GB of additional memory across the cluster — a real infrastructure cost. (3) CPU: idle overhead is minimal (Envoy sleeps between requests), but under high QPS, Envoy consumes CPU for L7 inspection, TLS termination, and metric computation — typically 0.1-0.3 CPU cores per pod under load. Mitigation: Istio Ambient Mesh (introduced in Istio 1.15) removes per-pod sidecars and uses shared node-level proxies (ztunnel for L4, waypoint proxies for L7) — reducing memory overhead by 60-70% while preserving most mesh capabilities. This is the direction the ecosystem is moving for cost-sensitive deployments.

When should you use a service mesh vs a simpler alternative?

A service mesh is justified when your system has: (1) 20+ microservices with complex service-to-service communication — managing retries, timeouts, and circuit breakers in every service's code is unsustainable; a mesh moves these to configuration. (2) Strict security requirements — mTLS with automated certificate rotation and SPIFFE identity is difficult to implement consistently across services without a mesh. (3) Need for fine-grained traffic control — canary deployments at the individual service level, header-based routing, fault injection for chaos testing. (4) Observability requirements — automatic L7 metrics and distributed traces for every service pair without instrumentation. Simpler alternatives worth considering: Kubernetes NetworkPolicy for L3/L4 traffic isolation (no per-pod sidecar overhead); cert-manager for automatic TLS certificate management; individual service libraries (Resilience4j, Hystrix, go-micro) for circuit breaking and retries in code; Linkerd (lighter than Istio, uses Rust-based micro-proxy with 1/5th the memory overhead). Avoid a service mesh if: your services communicate primarily with external clients (API gateway handles most concerns); you have fewer than 10 services; your team does not have operational expertise with Kubernetes and networking — the debugging complexity of mTLS configuration issues and xDS misconfiguration is significant.

Companies That Ask This Question

Cloudflare Engineering Interview Guide

Uber Engineering Interview Guide

LinkedIn Engineering Interview Guide

Atlassian Engineering Interview Guide

Asked at: Netflix Interview Guide