Distributed Tracing System: Low-Level Design
A distributed tracing system records the journey of a request as it propagates through multiple services — API gateway, auth service, database, cache, downstream microservices. It reconstructs the full call graph with timing information, enabling engineers to identify which service is slow, where errors originate, and how much time is spent at each hop. This design covers trace context propagation, span recording, storage, and the query interface for finding slow traces.
Core Data Model
CREATE TABLE Trace (
trace_id VARCHAR(32) PRIMARY KEY, -- 128-bit hex, W3C traceparent format
service_name VARCHAR(100) NOT NULL, -- root service
operation VARCHAR(200) NOT NULL, -- "POST /api/checkout"
status VARCHAR(10) NOT NULL, -- ok, error
duration_ms INT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
user_id BIGINT,
tags JSONB NOT NULL DEFAULT '{}'
) PARTITION BY RANGE (started_at);
CREATE TABLE Span (
span_id VARCHAR(16) PRIMARY KEY, -- 64-bit hex
trace_id VARCHAR(32) NOT NULL,
parent_span_id VARCHAR(16), -- NULL for root span
service_name VARCHAR(100) NOT NULL,
operation VARCHAR(200) NOT NULL,
kind VARCHAR(20) NOT NULL, -- server, client, producer, consumer, internal
status VARCHAR(10) NOT NULL, -- ok, error, unset
status_message TEXT,
started_at TIMESTAMPTZ NOT NULL,
duration_ms INT NOT NULL,
tags JSONB NOT NULL DEFAULT '{}', -- {"db.type":"postgres","http.status_code":200}
events JSONB NOT NULL DEFAULT '[]' -- [{name,ts,attrs}] for logs within span
) PARTITION BY RANGE (started_at);
CREATE INDEX ON Span(trace_id);
CREATE INDEX ON Span(service_name, started_at DESC);
CREATE INDEX ON Trace(service_name, duration_ms DESC, started_at DESC);
CREATE INDEX ON Trace(status, started_at DESC) WHERE status='error';
-- GIN index for tag queries: find traces where http.status_code=500
CREATE INDEX ON Span USING GIN (tags);
Trace Context Propagation (W3C traceparent)
import secrets, time
from dataclasses import dataclass
from typing import Optional
@dataclass
class SpanContext:
trace_id: str # 32 hex chars (128-bit)
span_id: str # 16 hex chars (64-bit)
sampled: bool # whether to record this trace
def new_trace(sampled: bool = True) -> SpanContext:
return SpanContext(
trace_id=secrets.token_hex(16),
span_id=secrets.token_hex(8),
sampled=sampled,
)
def child_span(parent: SpanContext) -> SpanContext:
"""Create a child span context inheriting the parent's trace_id."""
return SpanContext(
trace_id=parent.trace_id,
span_id=secrets.token_hex(8),
sampled=parent.sampled,
)
def inject_headers(ctx: SpanContext) -> dict:
"""
W3C traceparent header: 00-{trace_id}-{span_id}-{flags}
flags: '01' = sampled, '00' = not sampled
"""
flags = '01' if ctx.sampled else '00'
return {'traceparent': f"00-{ctx.trace_id}-{ctx.span_id}-{flags}"}
def extract_context(headers: dict) -> Optional[SpanContext]:
"""Parse W3C traceparent from incoming request headers."""
tp = headers.get('traceparent') or headers.get('Traceparent')
if not tp:
return None
parts = tp.split('-')
if len(parts) != 4 or parts[0] != '00':
return None
return SpanContext(
trace_id=parts[1],
span_id=parts[2],
sampled=(parts[3] == '01'),
)
Span Recording SDK
import time, json
from contextlib import contextmanager
from collections import deque
import threading
# Thread-local active span context
_local = threading.local()
_batch = deque()
_batch_lock = threading.Lock()
BATCH_FLUSH_SIZE = 100
@contextmanager
def start_span(operation: str, service_name: str, kind: str = 'server',
parent_ctx: SpanContext = None, tags: dict = None):
"""
Context manager for recording a span.
Usage:
with start_span('SELECT * FROM orders', 'orders-service', kind='client') as span:
result = db.query(...)
span.set_tag('db.rows', len(result))
"""
parent = parent_ctx or getattr(_local, 'active_ctx', None)
ctx = child_span(parent) if parent else new_trace(sampled=_should_sample())
_local.active_ctx = ctx
span = {
'span_id': ctx.span_id,
'trace_id': ctx.trace_id,
'parent_span_id': parent.span_id if parent else None,
'service_name': service_name,
'operation': operation,
'kind': kind,
'status': 'ok',
'status_message': None,
'started_at': time.time(),
'duration_ms': None,
'tags': tags or {},
'events': [],
}
try:
yield _SpanHandle(span)
span['status'] = 'ok'
except Exception as e:
span['status'] = 'error'
span['status_message'] = str(e)
raise
finally:
span['duration_ms'] = int((time.time() - span['started_at']) * 1000)
if ctx.sampled:
_enqueue_span(span)
class _SpanHandle:
def __init__(self, span): self._span = span
def set_tag(self, key, val): self._span['tags'][key] = val
def add_event(self, name, attrs=None):
self._span['events'].append({'name': name, 'ts': time.time(), 'attrs': attrs or {}})
def set_error(self, msg): self._span['status'] = 'error'; self._span['status_message'] = msg
def _enqueue_span(span: dict):
with _batch_lock:
_batch.append(span)
if len(_batch) >= BATCH_FLUSH_SIZE:
_flush_batch()
def _flush_batch():
spans = []
with _batch_lock:
while _batch:
spans.append(_batch.popleft())
if spans:
# Batch insert to collector (async HTTP POST or Kafka)
_send_to_collector(spans)
def _should_sample() -> bool:
"""Head-based sampling: sample 10% of traces. Can be dynamic per service."""
import random
return random.random() < 0.10
Query: Finding Slow and Errored Traces
-- Top 20 slowest traces in the last hour for a service
SELECT trace_id, operation, duration_ms, started_at, user_id, tags
FROM Trace
WHERE service_name='checkout-service'
AND started_at >= NOW() - INTERVAL '1 hour'
ORDER BY duration_ms DESC
LIMIT 20;
-- All spans in a trace (for waterfall visualization)
SELECT span_id, parent_span_id, service_name, operation, kind,
status, status_message, started_at,
started_at + (duration_ms || ' milliseconds')::INTERVAL AS ended_at,
duration_ms, tags
FROM Span
WHERE trace_id='abc123def456...'
ORDER BY started_at ASC;
-- Services with error rate > 1% in the last 5 minutes
SELECT service_name,
COUNT(*) AS total_spans,
SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) AS error_spans,
ROUND(100.0 * SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) / COUNT(*), 2) AS error_rate_pct
FROM Span
WHERE started_at >= NOW() - INTERVAL '5 minutes'
GROUP BY service_name
HAVING SUM(CASE WHEN status='error' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) > 1
ORDER BY error_rate_pct DESC;
Key Design Decisions
- W3C traceparent standard: using the W3C standard header means any OpenTelemetry-compatible library or vendor (Jaeger, Zipkin, Datadog, AWS X-Ray) can participate without custom integration. A service only needs to parse the incoming traceparent, create a child span context, and forward the header to downstream calls.
- Head-based sampling at 10%: recording every span for every request is prohibitively expensive at high traffic. Head-based sampling decides at trace creation time (the root span) whether to record the trace, then propagates the sampling decision through all child spans via the flags field in traceparent. All spans for a sampled trace are recorded; none for an unsampled trace. Alternative: tail-based sampling (always record; filter to keep errors and slow traces) — more useful for debugging but requires buffering all spans until the trace completes.
- Batched async writes: flushing each span individually adds 1–5ms of I/O overhead per span. Batching 100 spans into a single POST reduces this to <0.1ms per span. Use a background thread for flushing so the hot path is never blocked.
- GIN index on tags: spans carry structured tags (http.status_code, db.table, rpc.method). A GIN index on the tags JSONB column enables tag-based queries — “find all spans where db.table=’orders’ and duration_ms > 500” — without a full table scan.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between distributed tracing, logging, and metrics?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Logs: timestamped text records of individual events ("user 42 logged in at 14:03:22"). Unstructured or semi-structured; high volume; best for debugging specific known events. Metrics: aggregated numerical measurements over time (request_rate=142 req/s, p99_latency=230ms). Low cardinality; queryable over time ranges; best for dashboards and alerting. Traces: the causal chain of operations for a single request across multiple services. A trace shows that request X took 340ms: 5ms in API gateway, 8ms in auth service, 290ms in orders DB, 37ms in cache. Traces answer "why was this specific request slow?" — a question logs and metrics cannot answer alone. The three are complementary: metrics alert you that p99 spiked, logs give you events around the spike, traces show you exactly which service and operation is causing it. In practice, correlate all three with trace_id: include the trace_id in log lines and as a metric tag so you can pivot from a metric alert to logs and traces for the same request.”}},{“@type”:”Question”,”name”:”What is sampling and what are the trade-offs between head-based and tail-based sampling?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Sampling reduces the volume of trace data recorded. Recording every span for every request at 100K RPS would generate millions of span writes per second — too expensive to store and query. Head-based sampling: decide at trace creation (the root span) whether to record this trace, based on a fixed rate (e.g., 10%) or dynamic rate. The sampling decision propagates through all child spans in the traceparent flags field. Simple to implement; zero overhead for unsampled traces. Disadvantage: errors and slow traces are sampled at the same rate as fast traces — you may miss 90% of rare errors. Tail-based sampling: buffer all spans for a trace, wait for the trace to complete, then decide whether to keep it based on its characteristics (status=error, duration > 1s, specific user IDs). Much more useful for debugging — keeps 100% of error traces and slow traces while sampling out fast successful traces. Disadvantage: requires buffering all spans in a collector before sampling decisions are made, adding latency and complexity.”}},{“@type”:”Question”,”name”:”How do you correlate traces with logs for effective debugging?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When a trace shows a slow span in the payments-service, you want to see the log lines that occurred within that span. Correlation: include trace_id and span_id in every log line emitted during a request. In Python: import logging; logger = logging.getLogger(__name__). In the request middleware: logging.LoggerAdapter(logger, extra={‘trace_id’: ctx.trace_id, ‘span_id’: ctx.span_id}). Log aggregators (ELK, Datadog, Splunk) index the trace_id field. When investigating a slow trace in the trace viewer, copy the trace_id and search logs: trace_id:"abc123def456". All log lines from all services during that request appear, in chronological order. For structured logging (JSON): {"ts":"2026-04-17T14:03:22Z","level":"ERROR","msg":"DB timeout","trace_id":"abc123","span_id":"def456","latency_ms":5001}. The trace_id in the log enables drill-down from trace viewer → log viewer for the exact error message that caused the span to fail.”}},{“@type”:”Question”,”name”:”How do you implement tracing for async operations like job queues?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A synchronous HTTP call naturally passes the traceparent header. An async job (enqueue a job, process it 30 seconds later) breaks the trace chain — there is no header to pass. Solutions: (1) store trace context in the job payload: when enqueueing, include {"traceparent": inject_headers(current_ctx)} in the job payload. When the worker starts, extract_context(job.payload["traceparent"]) and create a child span with that as parent. This creates a "linked" trace — the original HTTP request trace shows the job was enqueued, and the worker trace shows processing. (2) Trace links (OpenTelemetry): instead of parent-child relationship, use a "link" to reference the parent trace without being its child. This is semantically more accurate (async work is not a synchronous child). Most trace viewers (Jaeger, Tempo) support linked traces for visualization. (3) Propagate through Kafka message headers or SQS message attributes using the same traceparent header format.”}},{“@type”:”Question”,”name”:”How do you use distributed traces to identify N+1 query problems in a microservices architecture?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”N+1 in microservices: a service fetches a list of 50 orders, then makes 50 individual API calls to the user-service to get each user’s name — instead of one batched call. In a trace waterfall view: one "list_orders" span has 50 child spans each calling "get_user" — a clear visual pattern of many sequential identical operations. Detection: in the trace query interface, find traces where: (1) span.operation matches a pattern ("get_user") and COUNT(*) per trace > 10; (2) total duration >> any single span duration (the sum of 50 × 5ms calls = 250ms, where one batched call would be 10ms). The trace makes it immediately visible; without tracing you would only see elevated total latency with no indication of the root cause. Fix: add a batch "get_users_by_ids" endpoint to the user-service and call it once with all 50 IDs. Verify the fix by rerunning the same flow and observing the trace collapse from 51 spans to 2.”}}]}
Distributed tracing and observability system design is discussed in Uber system design interview questions.
Distributed tracing and microservice observability design is covered in Netflix system design interview preparation.
Distributed tracing and data pipeline observability design is discussed in Databricks system design interview guide.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering