Distributed Tracing Service Low-Level Design: Span Collection, Context Propagation, and Sampling

A distributed tracing service provides end-to-end visibility into request flows across microservices by collecting spans, propagating trace context, applying intelligent sampling, and rendering flamegraphs. This design covers the full pipeline from instrumentation to storage to query.

Requirements

Functional

Propagate trace context across HTTP, gRPC, and message queue boundaries using W3C Trace Context headers.
Collect spans from all instrumented services via a local agent or direct SDK export.
Apply tail-based sampling to retain 100% of error and slow traces while sampling routine ones.
Store spans in a queryable backend with retention policies.
Render flamegraphs and service-call graphs for individual traces.

Non-Functional

Instrumentation overhead under 1 ms per span creation.
End-to-end ingestion latency under 30 seconds.
Query latency under 2 seconds for traces within the last 24 hours.

Data Model

Span — traceId (128-bit hex), spanId (64-bit hex), parentSpanId, serviceName, operationName, startTimeUnixNano, durationNano, statusCode (OK, ERROR, UNSET), attributes (key-value map), events (list of timestamped messages), links (for fan-in/fan-out traces).
Trace — logical grouping of all spans sharing a traceId. Not stored as a row; assembled at query time from the span index.
SamplingDecision — traceId, decision (KEEP or DROP), reason, decidedAt. Written by the tail sampler after the root span arrives.
ServiceGraph edge — source, destination, callCount, errorCount, p99LatencyMs — aggregated per minute from span data.

Context Propagation

Follow the W3C Trace Context specification. Inject traceparent: 00-{traceId}-{spanId}-{traceFlags} and optionally tracestate into every outbound HTTP header and gRPC metadata entry. For Kafka messages, add trace context to message headers. Each service extracts the parent context, creates a child span with a new spanId, and sets parentSpanId to the extracted span ID. The traceFlags byte carries the sampling bit — if the head sampler marked the trace for collection at the edge, all downstream services honor it without re-sampling.

Core Algorithms

Head-Based vs Tail-Based Sampling

Head-based sampling decides at trace ingress: flip a biased coin based on sampleRate and propagate the decision via the sampling bit in traceparent. It is cheap but cannot preferentially retain error traces discovered mid-flight. Tail-based sampling buffers all spans for a trace in memory for a configurable window (e.g., 30 seconds) and makes the keep/drop decision only after the root span arrives and the full trace shape is known. Keep all traces with any ERROR span or with root span duration above the p99 threshold. Drop routine traces at the configured rate.

Tail Sampler Buffer

The collector maintains a hash map from traceId to a span buffer. Spans arrive out of order. When the root span arrives (parentSpanId is null), evaluate the sampling policy. If KEEP, flush all buffered spans to storage and write a KEEP decision. If DROP, discard the buffer and write a DROP decision. Spans arriving after the decision for the same trace inherit it. Expire buffers with no root span after the window timeout to prevent memory leaks.

Flamegraph Rendering

Assemble a trace into a tree by linking each span to its parent via parentSpanId. Spans without a parent in the current dataset become roots. Sort children by startTimeUnixNano. Render each node as a horizontal bar scaled by durationNano, positioned by wall-clock offset from the trace start. Color by service name. Compute self-time as durationNano - sum(child durationNano) to identify where time is actually spent versus where it is delegated.

API Design

POST /v1/traces (OTLP HTTP) — OTLP-compatible span ingestion endpoint accepting protobuf or JSON.
GET /traces/{traceId} — return all spans for a trace, assembled into a tree.
GET /traces?service=X&minDuration=100ms&status=error&since=1h — search traces by attribute filters.
GET /services — list all instrumented services seen in the last 24 hours.
GET /service-graph — return the aggregated call graph as a directed edge list with latency stats.
GET /traces/{traceId}/flamegraph — return flamegraph data as a nested JSON tree for frontend rendering.

Scalability and Storage Tiering

Store recent spans (last 24 hours) in a columnar store (ClickHouse or Apache Parquet on S3 with Athena) for fast attribute-based queries.
Archive older spans to cold object storage with a TTL cleanup job.
Partition spans by traceId prefix so all spans of a trace land on the same storage shard, minimizing cross-shard joins at query time.
Emit tracer_spans_ingested_total{service} and tracer_sampling_decision{outcome} counters to monitor instrumentation coverage and tail sampler behavior.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does W3C trace context propagation work in a distributed tracing service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “W3C Trace Context defines two HTTP headers: traceparent (encoding version, trace ID, parent span ID, and sampling flags) and tracestate (vendor-specific key-value metadata). Each service reads the incoming headers, creates a child span with a new span ID referencing the received parent span ID, and forwards the updated headers to downstream calls, maintaining a causal chain across service boundaries.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between tail-based and head-based sampling in distributed tracing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Head-based sampling decides at the start of a trace (at the root span) whether to record it, propagating the decision to all downstream spans. It is simple but may miss rare error traces. Tail-based sampling buffers all spans for a trace and makes the keep/drop decision after the trace completes, allowing rules that always capture error traces or high-latency outliers regardless of their frequency.”
}
},
{
“@type”: “Question”,
“name”: “How is a span collection pipeline designed in a tracing service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Instrumented services emit spans to a local agent or sidecar, which batches and forwards them to a collector tier. The collector validates, enriches (adds resource attributes), and applies sampling decisions before writing spans to a backend store. The pipeline is designed for high throughput with back-pressure: if the collector is saturated it signals agents to drop or buffer locally rather than blocking application threads.”
}
},
{
“@type”: “Question”,
“name”: “How is flamegraph rendering implemented for distributed traces?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The backend reconstructs the trace tree from parent-child span relationships and computes each span's start time and duration relative to the trace root. The UI renders this as a timeline flamegraph where the x-axis is wall-clock time and each row is a depth level in the call tree. Spans are colored by service or error status, and clicking a span reveals its attributes and logs for root-cause analysis.”
}
}
]
}