Distributed Tracing: Low-Level Design – Tech Interview Dot Org

Distributed tracing tracks a single request as it traverses multiple services, recording the timing and relationships of each operation. In a microservices system, a single user request may touch 10-20 services — distributed tracing makes this call graph visible, turning opaque latency into actionable data. Without tracing, debugging why a specific request took 2 seconds is nearly impossible when the request touched 15 services.

Core Concepts

Trace: a complete record of a request’s journey through the system, identified by a globally unique trace ID. Span: a named, timed operation within a trace — “database query”, “HTTP call to inventory service”, “cache lookup”. Spans form a tree: a root span (the incoming HTTP request) contains child spans (the downstream calls it makes). Context propagation: the mechanism for passing trace IDs and parent span IDs between services — typically via HTTP headers (traceparent in W3C Trace Context format).

Instrumentation

Automatic Instrumentation

Most tracing libraries auto-instrument common frameworks: HTTP servers (gin, Express, FastAPI), HTTP clients (axios, requests), database drivers (sqlx, pg, SQLAlchemy), and message queue clients (kafka-go, amqplib). Automatic instrumentation captures spans for these operations without code changes — just initialize the tracing library and it patches the relevant libraries. This provides immediate visibility into the majority of latency sources.

Manual Instrumentation

For business logic that auto-instrumentation cannot capture — a complex calculation, a batch processing loop, a custom cache lookup — add spans manually: start_span(“price_calculation”), perform the work, end_span(). Add attributes to spans: span.set_attribute(“user_id”, user.id), span.set_attribute(“item_count”, len(items)). These attributes become searchable in the tracing UI, enabling queries like “show all traces where user_id=12345 and item_count>100”.

Context Propagation

When service A calls service B, it must pass the trace ID and parent span ID so service B can create a child span correctly linked to the trace. The W3C Trace Context standard defines the traceparent header format: version-trace_id-parent_id-flags. Service B reads this header, creates a child span with the received parent_id, and injects the updated header into any downstream calls it makes. This chain of header passing is context propagation — it connects spans across service boundaries into a complete trace graph.

Sampling Strategy

At high traffic, storing every trace is expensive — 10,000 requests/second × 50 spans/trace × 1KB/span = 500MB/second of trace data. Sampling reduces this to a manageable volume:

Head-based sampling: the decision to sample is made at the start of the trace (at the root service). Simple to implement, but a 1% sample rate means 99% of errors are not captured. Use adaptive head sampling: sample 100% of errors and slow requests, 1% of successful fast requests.

Tail-based sampling: the decision is made after the trace is complete, based on the full trace data. Enables sampling 100% of interesting traces (errors, high latency, specific endpoints) without predetermined rules. More complex: requires buffering all spans from a trace before making the sampling decision. Jaeger and Grafana Tempo support tail-based sampling.

Trace Storage and Querying

Trace data is time-series: written once (when the trace completes), queried by trace ID or by attribute filters. Storage requirements: Jaeger uses Elasticsearch or Cassandra; Tempo uses object storage (S3/GCS) with Parquet, enabling cheap long-term retention. Trace search queries: “show traces from the last hour where service=checkout and duration>1s and http.status=500”. This requires an index on (service, time, duration, status) — separate from the raw trace storage.

Connecting Traces, Metrics, and Logs

Maximum observability comes from correlating all three signals. Include trace_id and span_id in every log line — when an error log fires, click through to the trace to see the full call graph for that request. Link from traces to metrics: a slow trace shows which service was slow; the service’s metrics show whether it was slow for all requests (resource exhaustion) or only this one (specific data path). Grafana’s observability stack (Loki + Tempo + Mimir) enables this correlation natively through shared trace IDs in log streams.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is a trace, a span, and context propagation in distributed tracing?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A trace is a complete record of a request’s journey through the system, identified by a globally unique trace ID. A span is a named, timed operation within a trace — ‘database query’, ‘HTTP call to inventory service’, ‘cache lookup’. Spans form a parent-child tree: the root span (the incoming HTTP request) has child spans for each downstream call it makes. Context propagation is the mechanism for passing trace IDs and parent span IDs between services via HTTP headers (the W3C traceparent header). When service A calls service B, it injects the traceparent header; service B reads it and creates a child span linked to the same trace, building the complete call graph.”} }, { “@type”: “Question”, “name”: “What is the difference between head-based and tail-based sampling?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Head-based sampling makes the decision at the start of the trace (at the root service), before any spans are collected. Simple to implement but a fixed 1% sample rate means 99% of errors are lost. Mitigate with adaptive head sampling: always sample errors and slow requests (100%), sample fast successful requests at 1%. Tail-based sampling makes the decision after the full trace is collected, based on the complete trace data. This enables sampling 100% of interesting traces (errors, high latency, specific users) while discarding routine traces. More powerful but complex: requires buffering all spans from every trace before deciding. Jaeger and Grafana Tempo support tail-based sampling.”} }, { “@type”: “Question”, “name”: “How do you correlate traces with logs and metrics?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Include trace_id and span_id as fields in every structured log line. When an error log fires, the trace_id lets you click through to the complete trace for that specific request — seeing which services were called, in what order, and what latency each contributed. Link traces to metrics by correlating the service name and time: a slow trace identifies which service was slow; the service’s p99 latency metric shows whether it was slow for all requests (resource exhaustion) or only this one (specific data path). Grafana’s Loki (logs) + Tempo (traces) + Mimir (metrics) stack enables this three-way correlation natively via shared trace IDs embedded in log streams and metric exemplars.”} }, { “@type”: “Question”, “name”: “What should you add as span attributes for maximum tracing value?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “High-value span attributes: user_id and tenant_id (enables filtering ‘show all traces for this user’), request_id (correlates with external identifiers), db.query and db.rows_returned (identifies slow queries), http.status_code and error.message (surfaces errors), cache.hit (identifies cache miss patterns), and business-domain attributes like order_id or product_id (enables ‘show all traces where this order was involved’). Follow OpenTelemetry semantic conventions for common attributes — standard names (http.method, db.system, rpc.service) work with pre-built dashboards and reduce cognitive overhead when onboarding new engineers to the tracing system.”} } ] }