Distributed tracing tracks a single request as it traverses multiple services, recording the timing and relationships of each operation. In a microservices system, a single user request may touch 10-20 services — distributed tracing makes this call graph visible, turning opaque latency into actionable data. Without tracing, debugging why a specific request took 2 seconds is nearly impossible when the request touched 15 services.
Core Concepts
Trace: a complete record of a request’s journey through the system, identified by a globally unique trace ID. Span: a named, timed operation within a trace — “database query”, “HTTP call to inventory service”, “cache lookup”. Spans form a tree: a root span (the incoming HTTP request) contains child spans (the downstream calls it makes). Context propagation: the mechanism for passing trace IDs and parent span IDs between services — typically via HTTP headers (traceparent in W3C Trace Context format).
Instrumentation
Automatic Instrumentation
Most tracing libraries auto-instrument common frameworks: HTTP servers (gin, Express, FastAPI), HTTP clients (axios, requests), database drivers (sqlx, pg, SQLAlchemy), and message queue clients (kafka-go, amqplib). Automatic instrumentation captures spans for these operations without code changes — just initialize the tracing library and it patches the relevant libraries. This provides immediate visibility into the majority of latency sources.
Manual Instrumentation
For business logic that auto-instrumentation cannot capture — a complex calculation, a batch processing loop, a custom cache lookup — add spans manually: start_span(“price_calculation”), perform the work, end_span(). Add attributes to spans: span.set_attribute(“user_id”, user.id), span.set_attribute(“item_count”, len(items)). These attributes become searchable in the tracing UI, enabling queries like “show all traces where user_id=12345 and item_count>100”.
Context Propagation
When service A calls service B, it must pass the trace ID and parent span ID so service B can create a child span correctly linked to the trace. The W3C Trace Context standard defines the traceparent header format: version-trace_id-parent_id-flags. Service B reads this header, creates a child span with the received parent_id, and injects the updated header into any downstream calls it makes. This chain of header passing is context propagation — it connects spans across service boundaries into a complete trace graph.
Sampling Strategy
At high traffic, storing every trace is expensive — 10,000 requests/second × 50 spans/trace × 1KB/span = 500MB/second of trace data. Sampling reduces this to a manageable volume:
Head-based sampling: the decision to sample is made at the start of the trace (at the root service). Simple to implement, but a 1% sample rate means 99% of errors are not captured. Use adaptive head sampling: sample 100% of errors and slow requests, 1% of successful fast requests.
Tail-based sampling: the decision is made after the trace is complete, based on the full trace data. Enables sampling 100% of interesting traces (errors, high latency, specific endpoints) without predetermined rules. More complex: requires buffering all spans from a trace before making the sampling decision. Jaeger and Grafana Tempo support tail-based sampling.
Trace Storage and Querying
Trace data is time-series: written once (when the trace completes), queried by trace ID or by attribute filters. Storage requirements: Jaeger uses Elasticsearch or Cassandra; Tempo uses object storage (S3/GCS) with Parquet, enabling cheap long-term retention. Trace search queries: “show traces from the last hour where service=checkout and duration>1s and http.status=500”. This requires an index on (service, time, duration, status) — separate from the raw trace storage.
Connecting Traces, Metrics, and Logs
Maximum observability comes from correlating all three signals. Include trace_id and span_id in every log line — when an error log fires, click through to the trace to see the full call graph for that request. Link from traces to metrics: a slow trace shows which service was slow; the service’s metrics show whether it was slow for all requests (resource exhaustion) or only this one (specific data path). Grafana’s observability stack (Loki + Tempo + Mimir) enables this correlation natively through shared trace IDs in log streams.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide
See also: Coinbase Interview Guide
See also: Shopify Interview Guide
See also: Snap Interview Guide
See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems