Service Mesh: Low-Level Design

A service mesh is infrastructure that manages service-to-service communication in a microservices architecture. It handles cross-cutting concerns — traffic management, mutual TLS, observability, retry logic — at the infrastructure layer rather than in application code. Every service pair communicates through a proxy (sidecar), and those proxies are centrally managed by a control plane. The mesh makes network behavior configurable without application changes.

Architecture: Data Plane and Control Plane

Data plane: lightweight proxy sidecars (Envoy) injected alongside each service pod. Every inbound and outbound network packet passes through the sidecar. The sidecar enforces policies (mTLS, retries, timeouts, circuit breaking) and emits telemetry (metrics, logs, traces). The application is unaware of the sidecar — it connects to localhost, and the sidecar handles all network concerns transparently.

Control plane: manages and configures all sidecar proxies centrally (Istio, Linkerd, Consul Connect). Operators define policies (VirtualService, DestinationRule in Istio) that the control plane translates into Envoy configuration and pushes to all sidecars. This separation means: the data plane handles traffic at line rate (microsecond per-request overhead), the control plane handles configuration updates (seconds of propagation latency is acceptable).

Traffic Management

A service mesh enables fine-grained traffic routing without application code changes:

Canary deployments: route 5% of traffic to v2 of a service, 95% to v1, based on weight — not on application-level feature flags. If v2 error rate is acceptable, shift to 50/50, then 100%. Roll back by changing weights, not by redeploying.

Header-based routing: route requests with X-Beta-User: true to the v2 deployment, all others to v1. Enables internal testing of the new version with production traffic before general release.

Fault injection: configure the mesh to inject 5% HTTP 500 errors or 100ms delays for specific service pairs — testing resilience without modifying application code.

Mutual TLS (mTLS)

The mesh issues each service a short-lived TLS certificate (SPIFFE ID: spiffe://cluster/ns/default/sa/payment-service). Every service-to-service connection uses these certificates for mutual authentication — both sides prove their identity. This provides: encryption in transit (no plaintext service-to-service traffic), service identity verification (payment-service cannot impersonate inventory-service), and authorization policies (inventory-service can only be called by checkout-service and payment-service). mTLS is configured at the mesh level — application code does not handle certificates.

Observability Without Code Changes

Because all traffic passes through sidecars, the mesh automatically generates: golden signal metrics (request rate, error rate, latency) for every service pair — no application instrumentation needed for basic metrics; distributed traces (the sidecar propagates the traceparent header); access logs with full request metadata. A team can deploy a legacy service with no observability code and immediately get service-level metrics, service maps, and tracing from the mesh.

Cost: Latency and Complexity

Every request traverses two extra network hops (through the caller’s sidecar and the callee’s sidecar). Envoy adds ~0.5-1ms per hop for typical workloads — 1-2ms total per service call. For services with tight latency requirements (< 5ms SLO), this overhead matters. A service mesh also adds operational complexity: the control plane must be highly available (if it fails, new proxies cannot be configured), mesh configuration language is complex (Istio's CRDs are notoriously difficult), and debugging mesh-level issues requires expertise beyond application debugging. Evaluate whether the cross-cutting benefits outweigh the overhead for your team's scale and expertise.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between the data plane and control plane in a service mesh?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Data plane: lightweight proxy sidecars (Envoy) injected alongside each service pod. Every inbound and outbound packet passes through the sidecar, which enforces policies (mTLS, retries, timeouts, circuit breaking) and emits telemetry. The application connects to localhost — the sidecar handles all network concerns transparently at line rate (sub-millisecond overhead per request). Control plane: centrally manages and configures all sidecar proxies (Istio, Linkerd). Operators define policies as Kubernetes CRDs; the control plane translates these to Envoy configuration and pushes updates to all sidecars. The control plane handles configuration at second-level propagation latency — slower than data plane processing is acceptable because configuration changes are infrequent.”} }, { “@type”: “Question”, “name”: “How does a service mesh enable canary deployments?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A service mesh routes traffic by weight between service versions without application code changes. Example: define a VirtualService in Istio routing 5% of traffic to checkout-v2 and 95% to checkout-v1. The sidecar proxies enforce this split at the network level. Monitor error rate and latency on v2; if healthy, shift to 50/50, then 100%. Roll back by adjusting weights — no redeployment needed. Header-based routing sends requests with X-Beta-User: true to v2 and all others to v1, enabling internal testing with production traffic before general release. Fault injection (5% 500 errors, 100ms delays on specific routes) tests resilience without modifying application code.”} }, { “@type”: “Question”, “name”: “What is mutual TLS in a service mesh and what does it provide?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Mutual TLS (mTLS) means both sides of a service-to-service connection authenticate with certificates. The mesh issues each service a SPIFFE identity certificate (spiffe://cluster/ns/default/sa/payment-service) rotated every 24 hours. Benefits: encryption in transit (no plaintext between services, even inside the cluster); service identity verification (payment-service proves it is payment-service, not a compromised pod pretending to be it); authorization policies (configure: inventory-service only accepts connections from checkout-service and payment-service). All certificate management is handled by the mesh — application code does not touch TLS. This eliminates the most common zero-trust networking implementation complexity.”} }, { “@type”: “Question”, “name”: “What latency overhead does a service mesh add?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Every service-to-service request traverses two extra Envoy hops: the caller’s sidecar (egress) and the callee’s sidecar (ingress). Each Envoy hop adds approximately 0.5-1ms for typical workloads — 1-2ms total round-trip overhead. For services with p99 latency targets below 5ms (high-frequency trading, real-time gaming), this overhead is significant. For services with 100ms+ SLOs (most web APIs), 1-2ms is acceptable. The overhead comes from user-space packet processing, TLS handshake CPU, and the extra loopback connections. If latency is critical, consider eBPF-based service meshes (Cilium) that enforce policies in the kernel rather than through sidecar proxies, reducing overhead to ~100 microseconds.”} } ] }

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top