Low Level Design: Distributed Tracing Design and Internals

Distributed tracing tracks a single request as it propagates through multiple microservices, capturing timing, errors, and context at each service boundary. Without tracing, debugging a slow or failing request in a microservices system requires correlating logs from dozens of services — effectively impossible at scale. Jaeger, Zipkin, AWS X-Ray, and OpenTelemetry implement distributed tracing. Understanding trace propagation, sampling strategies, and trace storage is essential for designing observable distributed systems.

Trace Structure: Spans and Context Propagation

A trace represents the end-to-end journey of one request. A trace consists of spans: each span represents one unit of work (one service, one database query, one external call). Span structure: trace_id (unique across the entire request journey), span_id (unique within the trace), parent_span_id (the span that called this one), operation name, start time, duration, service name, and key-value tags (HTTP method, status code, user_id, SQL query). Context propagation: the trace_id and current span_id are passed between services via HTTP headers (W3C Trace Context: traceparent header) or gRPC metadata. Each service creates a child span with the incoming span as parent, establishing the causal hierarchy.

// OpenTelemetry Go instrumentation
import "go.opentelemetry.io/otel"

func handleOrder(w http.ResponseWriter, r *http.Request) {
    // Extract incoming trace context from HTTP headers
    ctx := otel.GetTextMapPropagator().Extract(r.Context(),
        propagation.HeaderCarrier(r.Header))

    // Start a new span as child of incoming context
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "handleOrder")
    defer span.End()

    // Add attributes to span
    span.SetAttributes(
        attribute.String("order.id", r.URL.Query().Get("id")),
        attribute.String("user.id", getUserID(r)),
    )

    // When calling downstream service, inject trace context into outgoing headers
    req, _ := http.NewRequestWithContext(ctx, "GET", inventoryURL, nil)
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
    // inventoryURL receives traceparent header with same trace_id + this span as parent

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
    }
}

Sampling Strategies

A high-traffic service (10K req/s) generates 10,000 traces per second — storing all traces is prohibitively expensive. Sampling reduces the fraction of traces stored. Head-based sampling: decide at the start of the request whether to sample it (before processing). Simple to implement (one decision propagates to all downstream services via context headers). Problem: cannot prefer interesting (slow, errored) traces over boring ones — the decision is made before outcomes are known. Tail-based sampling: buffer all span data and make the sampling decision after the request completes. Can preferentially keep slow or errored traces. Requires a trace aggregation layer that collects all spans before making the decision. Used by Honeycomb and AWS X-Ray. Typical strategy: always sample errors (100%), sample slow requests (duration > P99), sample a small fraction of successful fast requests (0.1-1%).

Trace Storage and Query

Spans are sent from services to a collector (OpenTelemetry Collector, Jaeger Agent) which batches and forwards them to storage. Storage requirements: high write throughput (millions of spans/second), efficient retrieval by trace_id (O(1) lookup), index by service name, operation, duration, error status for analytical queries (“find all slow traces in the order service from the last hour”). Jaeger uses Cassandra or Elasticsearch as backends: Cassandra for write-heavy trace ingestion (fast writes, TTL-based expiry), Elasticsearch for full-text and tag-based trace search. Typical retention: 7-30 days (traces are diagnostic, not long-term data). Trace sampling rates and storage TTLs are balanced to stay within budget.

Key Interview Discussion Points

  • Correlation with logs and metrics: the three pillars of observability are traces, logs, and metrics; inject trace_id into all log statements (structured logging with trace_id field) so logs can be correlated with traces in the same debugging session
  • OpenTelemetry standard: OTel (OpenTelemetry) is the CNCF standard for instrumentation; one SDK instruments your service for traces, metrics, and logs, with pluggable exporters to Jaeger, Zipkin, Prometheus, or any backend — avoids vendor lock-in
  • Async span propagation: for async messaging (Kafka, SQS), inject trace context into message headers; consumers extract and continue the trace on the other side of the queue, enabling end-to-end tracing across message boundaries
  • Baggage propagation: trace context can carry baggage (key-value pairs) that flows with every downstream span — useful for propagating user_id, experiment flags, or tenant_id without passing them as explicit function parameters through every service
  • Critical path analysis: in a trace with parallel fan-out (service calls A, B, C simultaneously), the critical path is the longest chain of spans; identifying the critical path pinpoints which service to optimize to reduce overall latency
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is distributed tracing and how does it work?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Distributed tracing tracks a single request as it propagates through multiple microservices. A trace is the complete end-to-end journey; a span is one unit of work within that trace (one service, one database query, one external call). Each span records: trace_id (same across the whole request), span_id (unique within the trace), parent_span_id (who called this span), start time, duration, service name, and key-value tags. When service A calls service B, it injects the trace_id and current span_id into the HTTP request headers (W3C traceparent header: 00-{trace_id}-{parent_span_id}-01). Service B creates a child span with A span as parent, establishing a causal hierarchy. The resulting trace shows the entire call tree with timing, revealing which service caused a slowdown or error.” } }, { “@type”: “Question”, “name”: “What is the difference between head-based and tail-based sampling in distributed tracing?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “At 10,000 requests/second, storing all traces is prohibitively expensive — sampling selects which traces to keep. Head-based sampling: the decision is made at the start of the request (at the first service) before any processing. Simple to implement: the decision propagates to all downstream services via context headers. Problem: you cannot prefer slow or errored traces because the outcome is unknown at the start — you might discard the most interesting traces. Tail-based sampling: buffer all spans and make the sampling decision after the request completes. Can preferentially keep 100% of slow requests (duration > P95), 100% of errored requests, and a small fraction (0.1%) of successful fast requests. Requires a trace aggregation backend that collects all spans from all services before deciding. Honeycomb Refinery and AWS X-Ray implement tail-based sampling.” } }, { “@type”: “Question”, “name”: “What is OpenTelemetry and why is it the standard for instrumentation?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “OpenTelemetry (OTel) is a CNCF standard for observability instrumentation that covers traces, metrics, and logs with a single SDK. Before OTel, each tracing vendor (Jaeger, Zipkin, Datadog, New Relic) required their own SDK — switching vendors meant re-instrumenting the entire codebase. OTel provides a vendor-neutral API and SDK: instrument once, export to any backend via pluggable exporters. OTel auto-instrumentation agents automatically trace HTTP calls, database queries, and message queue interactions without code changes (Java and Python agents are especially capable). The OTLP protocol (OpenTelemetry Protocol) is the standard wire format for sending telemetry data from services to collectors and backends. OTel is now supported natively by AWS, Google Cloud, Datadog, Honeycomb, Grafana Tempo, and Jaeger.” } }, { “@type”: “Question”, “name”: “How do you propagate trace context through asynchronous message queues?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “HTTP trace context propagation (via traceparent header) works natively for synchronous calls. For async message queues (Kafka, SQS, RabbitMQ), inject the trace context into message headers or message attributes alongside the payload. When publishing a Kafka message, add the current trace context as Kafka record headers (W3C traceparent format). When the consumer processes the message, extract the trace context from the headers and create a new span with the message span as parent — creating a trace that spans across the queue boundary. This enables end-to-end tracing from the HTTP request that published the message through to the consumer that processed it, showing the complete causal chain including queue wait time. OpenTelemetry Kafka instrumentation handles this propagation automatically.” } } ] }
Scroll to Top