Question 1

What is a trace, a span, and context propagation in distributed tracing?

Accepted Answer

A trace is a complete record of a request's journey through the system, identified by a globally unique trace ID. A span is a named, timed operation within a trace — 'database query', 'HTTP call to inventory service', 'cache lookup'. Spans form a parent-child tree: the root span (the incoming HTTP request) has child spans for each downstream call it makes. Context propagation is the mechanism for passing trace IDs and parent span IDs between services via HTTP headers (the W3C traceparent header). When service A calls service B, it injects the traceparent header; service B reads it and creates a child span linked to the same trace, building the complete call graph.

Question 2

What is the difference between head-based and tail-based sampling?

Accepted Answer

Head-based sampling makes the decision at the start of the trace (at the root service), before any spans are collected. Simple to implement but a fixed 1% sample rate means 99% of errors are lost. Mitigate with adaptive head sampling: always sample errors and slow requests (100%), sample fast successful requests at 1%. Tail-based sampling makes the decision after the full trace is collected, based on the complete trace data. This enables sampling 100% of interesting traces (errors, high latency, specific users) while discarding routine traces. More powerful but complex: requires buffering all spans from every trace before deciding. Jaeger and Grafana Tempo support tail-based sampling.

Question 3

How do you correlate traces with logs and metrics?

Accepted Answer

Include trace_id and span_id as fields in every structured log line. When an error log fires, the trace_id lets you click through to the complete trace for that specific request — seeing which services were called, in what order, and what latency each contributed. Link traces to metrics by correlating the service name and time: a slow trace identifies which service was slow; the service's p99 latency metric shows whether it was slow for all requests (resource exhaustion) or only this one (specific data path). Grafana's Loki (logs) + Tempo (traces) + Mimir (metrics) stack enables this three-way correlation natively via shared trace IDs embedded in log streams and metric exemplars.

Question 4

What should you add as span attributes for maximum tracing value?

Accepted Answer

High-value span attributes: user_id and tenant_id (enables filtering 'show all traces for this user'), request_id (correlates with external identifiers), db.query and db.rows_returned (identifies slow queries), http.status_code and error.message (surfaces errors), cache.hit (identifies cache miss patterns), and business-domain attributes like order_id or product_id (enables 'show all traces where this order was involved'). Follow OpenTelemetry semantic conventions for common attributes — standard names (http.method, db.system, rpc.service) work with pre-built dashboards and reduce cognitive overhead when onboarding new engineers to the tracing system.

Distributed Tracing: Low-Level Design

Core Concepts

Instrumentation

Automatic Instrumentation

Manual Instrumentation

Context Propagation

Sampling Strategy

Trace Storage and Querying

Connecting Traces, Metrics, and Logs