Distributed Tracing: Low-Level Design

Distributed tracing tracks a single request as it traverses multiple services, recording the timing and relationships of each operation. In a microservices system, a single user request may touch 10-20 services — distributed tracing makes this call graph visible, turning opaque latency into actionable data. Without tracing, debugging why a specific request took 2 seconds is nearly impossible when the request touched 15 services.

Core Concepts

Trace: a complete record of a request’s journey through the system, identified by a globally unique trace ID. Span: a named, timed operation within a trace — “database query”, “HTTP call to inventory service”, “cache lookup”. Spans form a tree: a root span (the incoming HTTP request) contains child spans (the downstream calls it makes). Context propagation: the mechanism for passing trace IDs and parent span IDs between services — typically via HTTP headers (traceparent in W3C Trace Context format).

Instrumentation

Automatic Instrumentation

Most tracing libraries auto-instrument common frameworks: HTTP servers (gin, Express, FastAPI), HTTP clients (axios, requests), database drivers (sqlx, pg, SQLAlchemy), and message queue clients (kafka-go, amqplib). Automatic instrumentation captures spans for these operations without code changes — just initialize the tracing library and it patches the relevant libraries. This provides immediate visibility into the majority of latency sources.

Manual Instrumentation

For business logic that auto-instrumentation cannot capture — a complex calculation, a batch processing loop, a custom cache lookup — add spans manually: start_span(“price_calculation”), perform the work, end_span(). Add attributes to spans: span.set_attribute(“user_id”, user.id), span.set_attribute(“item_count”, len(items)). These attributes become searchable in the tracing UI, enabling queries like “show all traces where user_id=12345 and item_count>100”.

Context Propagation

When service A calls service B, it must pass the trace ID and parent span ID so service B can create a child span correctly linked to the trace. The W3C Trace Context standard defines the traceparent header format: version-trace_id-parent_id-flags. Service B reads this header, creates a child span with the received parent_id, and injects the updated header into any downstream calls it makes. This chain of header passing is context propagation — it connects spans across service boundaries into a complete trace graph.

Sampling Strategy

At high traffic, storing every trace is expensive — 10,000 requests/second × 50 spans/trace × 1KB/span = 500MB/second of trace data. Sampling reduces this to a manageable volume:

Head-based sampling: the decision to sample is made at the start of the trace (at the root service). Simple to implement, but a 1% sample rate means 99% of errors are not captured. Use adaptive head sampling: sample 100% of errors and slow requests, 1% of successful fast requests.

Tail-based sampling: the decision is made after the trace is complete, based on the full trace data. Enables sampling 100% of interesting traces (errors, high latency, specific endpoints) without predetermined rules. More complex: requires buffering all spans from a trace before making the sampling decision. Jaeger and Grafana Tempo support tail-based sampling.

Trace Storage and Querying

Trace data is time-series: written once (when the trace completes), queried by trace ID or by attribute filters. Storage requirements: Jaeger uses Elasticsearch or Cassandra; Tempo uses object storage (S3/GCS) with Parquet, enabling cheap long-term retention. Trace search queries: “show traces from the last hour where service=checkout and duration>1s and http.status=500”. This requires an index on (service, time, duration, status) — separate from the raw trace storage.

Connecting Traces, Metrics, and Logs

Maximum observability comes from correlating all three signals. Include trace_id and span_id in every log line — when an error log fires, click through to the trace to see the full call graph for that request. Link from traces to metrics: a slow trace shows which service was slow; the service’s metrics show whether it was slow for all requests (resource exhaustion) or only this one (specific data path). Grafana’s observability stack (Loki + Tempo + Mimir) enables this correlation natively through shared trace IDs in log streams.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top