Distributed Tracing System Low-Level Design: W3C Traceparent, Span Recording, Sampling, and Trace Visualization

Distributed Tracing System: Low-Level Design

A distributed tracing system records the journey of a request as it propagates through multiple services — API gateway, auth service, database, cache, downstream microservices. It reconstructs the full call graph with timing information, enabling engineers to identify which service is slow, where errors originate, and how much time is spent at each hop. This design covers trace context propagation, span recording, storage, and the query interface for finding slow traces.

Core Data Model

CREATE TABLE Trace (
    trace_id       VARCHAR(32) PRIMARY KEY,   -- 128-bit hex, W3C traceparent format
    service_name   VARCHAR(100) NOT NULL,     -- root service
    operation      VARCHAR(200) NOT NULL,     -- "POST /api/checkout"
    status         VARCHAR(10) NOT NULL,      -- ok, error
    duration_ms    INT NOT NULL,
    started_at     TIMESTAMPTZ NOT NULL,
    user_id        BIGINT,
    tags           JSONB NOT NULL DEFAULT '{}'
) PARTITION BY RANGE (started_at);

CREATE TABLE Span (
    span_id        VARCHAR(16) PRIMARY KEY,  -- 64-bit hex
    trace_id       VARCHAR(32) NOT NULL,
    parent_span_id VARCHAR(16),             -- NULL for root span
    service_name   VARCHAR(100) NOT NULL,
    operation      VARCHAR(200) NOT NULL,
    kind           VARCHAR(20) NOT NULL,     -- server, client, producer, consumer, internal
    status         VARCHAR(10) NOT NULL,     -- ok, error, unset
    status_message TEXT,
    started_at     TIMESTAMPTZ NOT NULL,
    duration_ms    INT NOT NULL,
    tags           JSONB NOT NULL DEFAULT '{}',  -- {"db.type":"postgres","http.status_code":200}
    events         JSONB NOT NULL DEFAULT '[]'   -- [{name,ts,attrs}] for logs within span
) PARTITION BY RANGE (started_at);

CREATE INDEX ON Span(trace_id);
CREATE INDEX ON Span(service_name, started_at DESC);
CREATE INDEX ON Trace(service_name, duration_ms DESC, started_at DESC);
CREATE INDEX ON Trace(status, started_at DESC) WHERE status='error';
-- GIN index for tag queries: find traces where http.status_code=500
CREATE INDEX ON Span USING GIN (tags);

Trace Context Propagation (W3C traceparent)

import secrets, time
from dataclasses import dataclass
from typing import Optional

@dataclass
class SpanContext:
    trace_id: str    # 32 hex chars (128-bit)
    span_id: str     # 16 hex chars (64-bit)
    sampled: bool    # whether to record this trace

def new_trace(sampled: bool = True) -> SpanContext:
    return SpanContext(
        trace_id=secrets.token_hex(16),
        span_id=secrets.token_hex(8),
        sampled=sampled,
    )

def child_span(parent: SpanContext) -> SpanContext:
    """Create a child span context inheriting the parent's trace_id."""
    return SpanContext(
        trace_id=parent.trace_id,
        span_id=secrets.token_hex(8),
        sampled=parent.sampled,
    )

def inject_headers(ctx: SpanContext) -> dict:
    """
    W3C traceparent header: 00-{trace_id}-{span_id}-{flags}
    flags: '01' = sampled, '00' = not sampled
    """
    flags = '01' if ctx.sampled else '00'
    return {'traceparent': f"00-{ctx.trace_id}-{ctx.span_id}-{flags}"}

def extract_context(headers: dict) -> Optional[SpanContext]:
    """Parse W3C traceparent from incoming request headers."""
    tp = headers.get('traceparent') or headers.get('Traceparent')
    if not tp:
        return None
    parts = tp.split('-')
    if len(parts) != 4 or parts[0] != '00':
        return None
    return SpanContext(
        trace_id=parts[1],
        span_id=parts[2],
        sampled=(parts[3] == '01'),
    )

Span Recording SDK

import time, json
from contextlib import contextmanager
from collections import deque
import threading

# Thread-local active span context
_local = threading.local()
_batch = deque()
_batch_lock = threading.Lock()
BATCH_FLUSH_SIZE = 100

@contextmanager
def start_span(operation: str, service_name: str, kind: str = 'server',
               parent_ctx: SpanContext = None, tags: dict = None):
    """
    Context manager for recording a span.
    Usage:
        with start_span('SELECT * FROM orders', 'orders-service', kind='client') as span:
            result = db.query(...)
            span.set_tag('db.rows', len(result))
    """
    parent = parent_ctx or getattr(_local, 'active_ctx', None)
    ctx = child_span(parent) if parent else new_trace(sampled=_should_sample())
    _local.active_ctx = ctx

    span = {
        'span_id': ctx.span_id,
        'trace_id': ctx.trace_id,
        'parent_span_id': parent.span_id if parent else None,
        'service_name': service_name,
        'operation': operation,
        'kind': kind,
        'status': 'ok',
        'status_message': None,
        'started_at': time.time(),
        'duration_ms': None,
        'tags': tags or {},
        'events': [],
    }

    try:
        yield _SpanHandle(span)
        span['status'] = 'ok'
    except Exception as e:
        span['status'] = 'error'
        span['status_message'] = str(e)
        raise
    finally:
        span['duration_ms'] = int((time.time() - span['started_at']) * 1000)
        if ctx.sampled:
            _enqueue_span(span)

class _SpanHandle:
    def __init__(self, span): self._span = span
    def set_tag(self, key, val): self._span['tags'][key] = val
    def add_event(self, name, attrs=None):
        self._span['events'].append({'name': name, 'ts': time.time(), 'attrs': attrs or {}})
    def set_error(self, msg): self._span['status'] = 'error'; self._span['status_message'] = msg

def _enqueue_span(span: dict):
    with _batch_lock:
        _batch.append(span)
        if len(_batch) >= BATCH_FLUSH_SIZE:
            _flush_batch()

def _flush_batch():
    spans = []
    with _batch_lock:
        while _batch:
            spans.append(_batch.popleft())
    if spans:
        # Batch insert to collector (async HTTP POST or Kafka)
        _send_to_collector(spans)

def _should_sample() -> bool:
    """Head-based sampling: sample 10% of traces. Can be dynamic per service."""
    import random
    return random.random() < 0.10

Query: Finding Slow and Errored Traces

-- Top 20 slowest traces in the last hour for a service
SELECT trace_id, operation, duration_ms, started_at, user_id, tags
FROM Trace
WHERE service_name='checkout-service'
  AND started_at >= NOW() - INTERVAL '1 hour'
ORDER BY duration_ms DESC
LIMIT 20;

-- All spans in a trace (for waterfall visualization)
SELECT span_id, parent_span_id, service_name, operation, kind,
       status, status_message, started_at,
       started_at + (duration_ms || ' milliseconds')::INTERVAL AS ended_at,
       duration_ms, tags
FROM Span
WHERE trace_id='abc123def456...'
ORDER BY started_at ASC;

-- Services with error rate > 1% in the last 5 minutes
SELECT service_name,
       COUNT(*) AS total_spans,
       SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) AS error_spans,
       ROUND(100.0 * SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) / COUNT(*), 2) AS error_rate_pct
FROM Span
WHERE started_at >= NOW() - INTERVAL '5 minutes'
GROUP BY service_name
HAVING SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) > 1
ORDER BY error_rate_pct DESC;

Key Design Decisions

  • W3C traceparent standard: using the W3C standard header means any OpenTelemetry-compatible library or vendor (Jaeger, Zipkin, Datadog, AWS X-Ray) can participate without custom integration. A service only needs to parse the incoming traceparent, create a child span context, and forward the header to downstream calls.
  • Head-based sampling at 10%: recording every span for every request is prohibitively expensive at high traffic. Head-based sampling decides at trace creation time (the root span) whether to record the trace, then propagates the sampling decision through all child spans via the flags field in traceparent. All spans for a sampled trace are recorded; none for an unsampled trace. Alternative: tail-based sampling (always record; filter to keep errors and slow traces) — more useful for debugging but requires buffering all spans until the trace completes.
  • Batched async writes: flushing each span individually adds 1–5ms of I/O overhead per span. Batching 100 spans into a single POST reduces this to <0.1ms per span. Use a background thread for flushing so the hot path is never blocked.
  • GIN index on tags: spans carry structured tags (http.status_code, db.table, rpc.method). A GIN index on the tags JSONB column enables tag-based queries — “find all spans where db.table=’orders’ and duration_ms > 500” — without a full table scan.

Distributed tracing and observability system design is discussed in Uber system design interview questions.

Distributed tracing and microservice observability design is covered in Netflix system design interview preparation.

Distributed tracing and data pipeline observability design is discussed in Databricks system design interview guide.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top