Service Mesh: Low-Level Design

A service mesh is infrastructure that manages service-to-service communication in a microservices architecture. It handles cross-cutting concerns — traffic management, mutual TLS, observability, retry logic — at the infrastructure layer rather than in application code. Every service pair communicates through a proxy (sidecar), and those proxies are centrally managed by a control plane. The mesh makes network behavior configurable without application changes.

Architecture: Data Plane and Control Plane

Data plane: lightweight proxy sidecars (Envoy) injected alongside each service pod. Every inbound and outbound network packet passes through the sidecar. The sidecar enforces policies (mTLS, retries, timeouts, circuit breaking) and emits telemetry (metrics, logs, traces). The application is unaware of the sidecar — it connects to localhost, and the sidecar handles all network concerns transparently.

Control plane: manages and configures all sidecar proxies centrally (Istio, Linkerd, Consul Connect). Operators define policies (VirtualService, DestinationRule in Istio) that the control plane translates into Envoy configuration and pushes to all sidecars. This separation means: the data plane handles traffic at line rate (microsecond per-request overhead), the control plane handles configuration updates (seconds of propagation latency is acceptable).

Traffic Management

A service mesh enables fine-grained traffic routing without application code changes:

Canary deployments: route 5% of traffic to v2 of a service, 95% to v1, based on weight — not on application-level feature flags. If v2 error rate is acceptable, shift to 50/50, then 100%. Roll back by changing weights, not by redeploying.

Header-based routing: route requests with X-Beta-User: true to the v2 deployment, all others to v1. Enables internal testing of the new version with production traffic before general release.

Fault injection: configure the mesh to inject 5% HTTP 500 errors or 100ms delays for specific service pairs — testing resilience without modifying application code.

Mutual TLS (mTLS)

The mesh issues each service a short-lived TLS certificate (SPIFFE ID: spiffe://cluster/ns/default/sa/payment-service). Every service-to-service connection uses these certificates for mutual authentication — both sides prove their identity. This provides: encryption in transit (no plaintext service-to-service traffic), service identity verification (payment-service cannot impersonate inventory-service), and authorization policies (inventory-service can only be called by checkout-service and payment-service). mTLS is configured at the mesh level — application code does not handle certificates.

Observability Without Code Changes

Because all traffic passes through sidecars, the mesh automatically generates: golden signal metrics (request rate, error rate, latency) for every service pair — no application instrumentation needed for basic metrics; distributed traces (the sidecar propagates the traceparent header); access logs with full request metadata. A team can deploy a legacy service with no observability code and immediately get service-level metrics, service maps, and tracing from the mesh.

Cost: Latency and Complexity

Every request traverses two extra network hops (through the caller’s sidecar and the callee’s sidecar). Envoy adds ~0.5-1ms per hop for typical workloads — 1-2ms total per service call. For services with tight latency requirements (< 5ms SLO), this overhead matters. A service mesh also adds operational complexity: the control plane must be highly available (if it fails, new proxies cannot be configured), mesh configuration language is complex (Istio's CRDs are notoriously difficult), and debugging mesh-level issues requires expertise beyond application debugging. Evaluate whether the cross-cutting benefits outweigh the overhead for your team's scale and expertise.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top