Question 1

What is the difference between the data plane and control plane in a service mesh?

Accepted Answer

Data plane: lightweight proxy sidecars (Envoy) injected alongside each service pod. Every inbound and outbound packet passes through the sidecar, which enforces policies (mTLS, retries, timeouts, circuit breaking) and emits telemetry. The application connects to localhost — the sidecar handles all network concerns transparently at line rate (sub-millisecond overhead per request). Control plane: centrally manages and configures all sidecar proxies (Istio, Linkerd). Operators define policies as Kubernetes CRDs; the control plane translates these to Envoy configuration and pushes updates to all sidecars. The control plane handles configuration at second-level propagation latency — slower than data plane processing is acceptable because configuration changes are infrequent.

Question 2

How does a service mesh enable canary deployments?

Accepted Answer

A service mesh routes traffic by weight between service versions without application code changes. Example: define a VirtualService in Istio routing 5% of traffic to checkout-v2 and 95% to checkout-v1. The sidecar proxies enforce this split at the network level. Monitor error rate and latency on v2; if healthy, shift to 50/50, then 100%. Roll back by adjusting weights — no redeployment needed. Header-based routing sends requests with X-Beta-User: true to v2 and all others to v1, enabling internal testing with production traffic before general release. Fault injection (5% 500 errors, 100ms delays on specific routes) tests resilience without modifying application code.

Question 3

What is mutual TLS in a service mesh and what does it provide?

Accepted Answer

Mutual TLS (mTLS) means both sides of a service-to-service connection authenticate with certificates. The mesh issues each service a SPIFFE identity certificate (spiffe://cluster/ns/default/sa/payment-service) rotated every 24 hours. Benefits: encryption in transit (no plaintext between services, even inside the cluster); service identity verification (payment-service proves it is payment-service, not a compromised pod pretending to be it); authorization policies (configure: inventory-service only accepts connections from checkout-service and payment-service). All certificate management is handled by the mesh — application code does not touch TLS. This eliminates the most common zero-trust networking implementation complexity.

Question 4

What latency overhead does a service mesh add?

Accepted Answer

Every service-to-service request traverses two extra Envoy hops: the caller's sidecar (egress) and the callee's sidecar (ingress). Each Envoy hop adds approximately 0.5-1ms for typical workloads — 1-2ms total round-trip overhead. For services with p99 latency targets below 5ms (high-frequency trading, real-time gaming), this overhead is significant. For services with 100ms+ SLOs (most web APIs), 1-2ms is acceptable. The overhead comes from user-space packet processing, TLS handshake CPU, and the extra loopback connections. If latency is critical, consider eBPF-based service meshes (Cilium) that enforce policies in the kernel rather than through sidecar proxies, reducing overhead to ~100 microseconds.

Service Mesh: Low-Level Design

Architecture: Data Plane and Control Plane

Traffic Management

Mutual TLS (mTLS)

Observability Without Code Changes

Cost: Latency and Complexity