Low Level Design: Distributed Tracing System

⏱ 3 min read

Distributed tracing tracks a single request as it propagates through multiple services in a microservices architecture. Without tracing, debugging latency issues requires correlating logs from a dozen services — a painful and error-prone process. A tracing system assigns each request a globally unique trace ID and records spans (units of work) with timing and metadata, enabling engineers to visualize the critical path and identify bottlenecks. Jaeger, Zipkin, Honeycomb, and Tempo are popular tracing backends.

Trace and Span Data Model

A trace represents the entire journey of a request: one trace ID shared across all services involved. A span represents a single unit of work within the trace (a function call, a database query, an HTTP request to a downstream service). Each span has: trace_id, span_id, parent_span_id (nil for the root span), operation_name, service_name, start_time, duration, status (ok/error), and tags (key-value metadata). The parent-child relationship forms a tree; the critical path is the longest chain of spans from root to leaf.

// Span structure (OpenTelemetry)
type Span struct {
    TraceID      [16]byte          // 128-bit globally unique
    SpanID       [8]byte           // 64-bit unique within trace
    ParentSpanID [8]byte           // 0 for root span
    Name         string            // e.g. "POST /api/orders"
    ServiceName  string            // e.g. "order-service"
    StartTime    time.Time
    EndTime      time.Time
    Status       Status            // Ok, Error, Unset
    Attributes   map[string]any    // http.status_code, db.statement, etc.
    Events       []Event           // timestamped log entries within span
}

// Context propagation via HTTP headers (W3C Trace Context)
// traceparent: 00-{traceID}-{spanID}-{flags}
// Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Context Propagation

Trace context must be threaded through every service call automatically. Each service extracts the trace context from incoming request headers (W3C traceparent header or Zipkin B3 headers), creates a child span, propagates the context to downstream calls (adding the new span ID as the parent), and reports the span to the tracing backend. Auto-instrumentation libraries (OpenTelemetry SDK) hook into HTTP clients, gRPC, database drivers, and message queue clients to inject/extract headers without manual code changes.

Sampling Strategies

At 100,000 requests/second, storing every span is prohibitively expensive. Sampling reduces trace volume. Head-based sampling: decide at the first service (trace start) whether to record. Simple: 1% random sampling. Problem: drops rare error traces. Tail-based sampling: collect all spans, decide at the end (trace completion) based on outcome — always sample traces with errors or high latency. Requires buffering spans until the trace completes. Honeycomb and Grafana Tempo support tail-based sampling. Combine: 1% random head sample + 100% sample for errors.

Storage and Query Architecture

Tracing backends receive spans via Kafka (high throughput ingestion buffer) and store in a columnar store (ClickHouse, Cassandra, or Elasticsearch). The primary query patterns: lookup by trace_id (single trace view) and search by attributes (find all traces with error=true, service=checkout, duration>500ms). Optimize storage for both: store spans in a table partitioned by trace_id for the first pattern; maintain inverted indexes on service_name, status, and duration for the second. Retain high-cardinality traces (errors, slow requests) longer than routine traces.

Key Interview Discussion Points

OpenTelemetry: vendor-neutral standard for trace/metric/log SDKs and collector; avoids vendor lock-in
Baggage: key-value data propagated through the entire trace (user_id, experiment variant) for correlation without log joining
Span links: connect related traces (a background job processing an async task links back to the originating request)
Exemplars: link a metric data point to the specific trace that caused it — bridges metrics and tracing
Clock skew: spans from different services may have inconsistent timestamps; use logical clocks or accept bounded skew in visualization