Service Mesh Low-Level Design: Sidecar Proxy, mTLS, Traffic Policies, and Observability

What Is a Service Mesh?

A service mesh is an infrastructure layer that manages service-to-service communication in a microservices architecture. Rather than embedding network concerns (retries, timeouts, circuit breaking, mTLS) in each service's code, the mesh externalizes them into sidecar proxies that intercept all traffic. Istio (using Envoy) and Linkerd are the dominant implementations. At low level, the mesh consists of a data plane (sidecars on every pod) and a control plane (configuration distribution, certificate issuance, telemetry aggregation).

Sidecar Proxy Injection

Each service pod runs two containers: the application container and the Envoy sidecar proxy. The control plane uses a Kubernetes MutatingAdmissionWebhook to automatically inject the sidecar into any pod in a labeled namespace — no application code change is required. The sidecar is configured via iptables rules to intercept all inbound traffic on port 15006 and all outbound traffic on port 15001, making it transparent to the application.

The sidecar handles:

  • Inbound traffic: authenticate the caller via mTLS, apply rate limits, record metrics, forward to localhost application port.
  • Outbound traffic: resolve destination service endpoint, apply retry/timeout/circuit-breaker policy, negotiate mTLS with destination sidecar, emit trace spans.

Mutual TLS (mTLS)

mTLS provides zero-trust service identity. The control plane's certificate authority issues short-lived X.509 certificates (typically 24-hour TTL) to each sidecar, signed by the mesh CA. The SPIFFE/SPIRE standard defines the certificate Subject Alternative Name as a URI in the form spiffe://trust-domain/ns/namespace/sa/service-account.

On each connection:

  1. The calling sidecar presents its certificate to the destination sidecar.
  2. The destination sidecar verifies the certificate against the mesh CA root.
  3. The destination sidecar presents its own certificate; the caller verifies it.
  4. A mutually authenticated TLS session is established; all traffic is encrypted in transit.

Authorization policies (which service is allowed to call which other service on which port/path) are enforced at the destination sidecar using the verified SPIFFE identity from the certificate.

Control Plane Schema

CREATE TABLE ServicePolicy (
  service_name     VARCHAR(128) PRIMARY KEY,
  retry_attempts   INT NOT NULL DEFAULT 3,
  timeout_ms       INT NOT NULL DEFAULT 5000,
  cb_threshold     INT NOT NULL DEFAULT 50,  -- % error rate to open circuit
  cb_window_sec    INT NOT NULL DEFAULT 30,
  updated_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE CertificateRecord (
  service_name  VARCHAR(128) NOT NULL,
  cert_pem      TEXT NOT NULL,
  private_key   TEXT NOT NULL,  -- encrypted at rest
  expiry        TIMESTAMPTZ NOT NULL,
  issued_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  PRIMARY KEY (service_name, issued_at)
);

CREATE TABLE TrafficSplit (
  id                  BIGSERIAL PRIMARY KEY,
  source_service      VARCHAR(128),  -- NULL = applies to all callers
  destination_service VARCHAR(128) NOT NULL,
  v1_weight           INT NOT NULL DEFAULT 100,
  v2_weight           INT NOT NULL DEFAULT 0,
  header_match        JSONB,  -- optional header-based routing rules
  updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE ServiceEndpoint (
  service_name  VARCHAR(128) NOT NULL,
  pod_ip        INET NOT NULL,
  port          INT NOT NULL,
  version       VARCHAR(32),
  healthy       BOOLEAN NOT NULL DEFAULT TRUE,
  PRIMARY KEY (service_name, pod_ip, port)
);

Traffic Policies: Retries, Timeouts, and Circuit Breaking

Traffic policies are configured per destination service in the control plane and pushed to sidecars as Envoy configuration (via xDS APIs — LDS, RDS, CDS, EDS).

  • Retry policy: max_attempts = 3, retry_on = [connect-failure, retriable-4xx, 503]. Retries use exponential backoff with jitter. Non-idempotent methods (POST, PATCH) are not retried by default.
  • Timeout: per-request timeout enforced by the calling sidecar. If the upstream does not respond within timeout_ms, the sidecar returns 504 to the caller and records the timeout as a failure for circuit breaker accounting.
  • Circuit breaker: the sidecar tracks the error rate over a rolling window. When error rate exceeds cb_threshold percent, the circuit opens and subsequent requests fast-fail with 503 without hitting the upstream. After recovery_timeout, the circuit half-opens and allows probe requests.

Load Balancing

The sidecar uses the endpoint list from the control plane's EDS (Endpoint Discovery Service) to load balance across healthy pods:

  • Round-robin: default; distributes requests evenly.
  • Least-request: routes to the upstream with the fewest active requests; better for variable latency services.
  • Consistent hashing: hashes on a request header (e.g. user_id, session_id) for sticky sessions; ensures the same client always hits the same upstream pod for cache locality.

Traffic Shaping for Canary Releases

Weighted routing splits traffic between service versions without DNS changes. A TrafficSplit record sets v1_weight = 95 and v2_weight = 5 to send 5% of traffic to the canary. Header-based routing allows QA teams to force all their traffic to v2 via a custom header (X-Canary: true), independent of weight.

Python Control Plane

import ssl
import datetime
from cryptography import x509
from cryptography.x509.oid import NameOID
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from db import get_db

CERT_TTL_HOURS = 24
CA_CERT_PEM  = open('/etc/mesh-ca/ca.crt').read()
CA_KEY_PEM   = open('/etc/mesh-ca/ca.key').read()
TRUST_DOMAIN = 'cluster.local'

def issue_certificate(service_name: str, namespace: str, service_account: str) -> dict:
    """Issue a short-lived X.509 certificate for a service sidecar."""
    private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)

    spiffe_uri = f"spiffe://{TRUST_DOMAIN}/ns/{namespace}/sa/{service_account}"
    subject    = x509.Name([x509.NameAttribute(NameOID.COMMON_NAME, service_name)])

    ca_cert = x509.load_pem_x509_certificate(CA_CERT_PEM.encode())
    ca_key  = serialization.load_pem_private_key(CA_KEY_PEM.encode(), password=None)

    cert = (
        x509.CertificateBuilder()
        .subject_name(subject)
        .issuer_name(ca_cert.subject)
        .public_key(private_key.public_key())
        .serial_number(x509.random_serial_number())
        .not_valid_before(datetime.datetime.utcnow())
        .not_valid_after(datetime.datetime.utcnow() + datetime.timedelta(hours=CERT_TTL_HOURS))
        .add_extension(
            x509.SubjectAlternativeName([x509.UniformResourceIdentifier(spiffe_uri)]),
            critical=False
        )
        .sign(ca_key, hashes.SHA256())
    )

    cert_pem = cert.public_bytes(serialization.Encoding.PEM).decode()
    key_pem  = private_key.private_bytes(
        serialization.Encoding.PEM,
        serialization.PrivateFormat.TraditionalOpenSSL,
        serialization.NoEncryption()
    ).decode()

    db = get_db()
    db.execute("""
        INSERT INTO CertificateRecord (service_name, cert_pem, private_key, expiry)
        VALUES (%s, %s, %s, %s)
    """, (service_name, cert_pem, key_pem,
          datetime.datetime.utcnow() + datetime.timedelta(hours=CERT_TTL_HOURS)))
    db.commit()

    return {'cert_pem': cert_pem, 'private_key_pem': key_pem, 'spiffe_uri': spiffe_uri}

def apply_policy(service_name: str, policy: dict):
    """Upsert traffic policy for a service; control plane pushes to sidecars via xDS."""
    db = get_db()
    db.execute("""
        INSERT INTO ServicePolicy (service_name, retry_attempts, timeout_ms, cb_threshold, cb_window_sec)
        VALUES (%s, %s, %s, %s, %s)
        ON CONFLICT (service_name) DO UPDATE SET
          retry_attempts = EXCLUDED.retry_attempts,
          timeout_ms     = EXCLUDED.timeout_ms,
          cb_threshold   = EXCLUDED.cb_threshold,
          cb_window_sec  = EXCLUDED.cb_window_sec,
          updated_at     = NOW()
    """, (service_name,
          policy.get('retry_attempts', 3),
          policy.get('timeout_ms', 5000),
          policy.get('cb_threshold', 50),
          policy.get('cb_window_sec', 30)))
    db.commit()

def compute_traffic_split(destination_service: str, request_headers: dict) -> str:
    """Determine which service version to route to based on weights and header rules."""
    import random
    db = get_db()
    split = db.execute("""
        SELECT v1_weight, v2_weight, header_match FROM TrafficSplit
        WHERE destination_service = %s
        ORDER BY updated_at DESC LIMIT 1
    """, (destination_service,)).fetchone()

    if not split:
        return 'v1'

    # Header-based routing takes precedence
    header_match = split['header_match'] or {}
    for header_name, expected_value in header_match.items():
        if request_headers.get(header_name) == expected_value:
            return 'v2'

    # Weighted random
    total = split['v1_weight'] + split['v2_weight']
    if total == 0:
        return 'v1'
    return 'v2' if random.randint(1, total) <= split['v2_weight'] else 'v1'

Observability: Distributed Tracing and Metrics

The sidecar automatically propagates trace headers (B3 or W3C TraceContext) on all outbound requests and creates child spans for each hop. The application only needs to forward received trace headers on downstream calls — the sidecar handles span creation and reporting to the tracing backend (Jaeger, Zipkin, or Tempo). Per-service metrics (request rate, error rate, P50/P99 latency) are emitted as Prometheus metrics by the sidecar and scraped by the control plane, requiring zero instrumentation in the application code.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top