Question 1

What is the difference between distributed tracing, logging, and metrics?

Accepted Answer

Logs: timestamped text records of individual events ("user 42 logged in at 14:03:22"). Unstructured or semi-structured; high volume; best for debugging specific known events. Metrics: aggregated numerical measurements over time (request_rate=142 req/s, p99_latency=230ms). Low cardinality; queryable over time ranges; best for dashboards and alerting. Traces: the causal chain of operations for a single request across multiple services. A trace shows that request X took 340ms: 5ms in API gateway, 8ms in auth service, 290ms in orders DB, 37ms in cache. Traces answer "why was this specific request slow?" — a question logs and metrics cannot answer alone. The three are complementary: metrics alert you that p99 spiked, logs give you events around the spike, traces show you exactly which service and operation is causing it. In practice, correlate all three with trace_id: include the trace_id in log lines and as a metric tag so you can pivot from a metric alert to logs and traces for the same request.

Question 2

What is sampling and what are the trade-offs between head-based and tail-based sampling?

Accepted Answer

Sampling reduces the volume of trace data recorded. Recording every span for every request at 100K RPS would generate millions of span writes per second — too expensive to store and query. Head-based sampling: decide at trace creation (the root span) whether to record this trace, based on a fixed rate (e.g., 10%) or dynamic rate. The sampling decision propagates through all child spans in the traceparent flags field. Simple to implement; zero overhead for unsampled traces. Disadvantage: errors and slow traces are sampled at the same rate as fast traces — you may miss 90% of rare errors. Tail-based sampling: buffer all spans for a trace, wait for the trace to complete, then decide whether to keep it based on its characteristics (status=error, duration > 1s, specific user IDs). Much more useful for debugging — keeps 100% of error traces and slow traces while sampling out fast successful traces. Disadvantage: requires buffering all spans in a collector before sampling decisions are made, adding latency and complexity.

Question 3

How do you correlate traces with logs for effective debugging?

Accepted Answer

When a trace shows a slow span in the payments-service, you want to see the log lines that occurred within that span. Correlation: include trace_id and span_id in every log line emitted during a request. In Python: import logging; logger = logging.getLogger(__name__). In the request middleware: logging.LoggerAdapter(logger, extra={'trace_id': ctx.trace_id, 'span_id': ctx.span_id}). Log aggregators (ELK, Datadog, Splunk) index the trace_id field. When investigating a slow trace in the trace viewer, copy the trace_id and search logs: trace_id:"abc123def456". All log lines from all services during that request appear, in chronological order. For structured logging (JSON): {"ts":"2026-04-17T14:03:22Z","level":"ERROR","msg":"DB timeout","trace_id":"abc123","span_id":"def456","latency_ms":5001}. The trace_id in the log enables drill-down from trace viewer → log viewer for the exact error message that caused the span to fail.

Question 4

How do you implement tracing for async operations like job queues?

Accepted Answer

A synchronous HTTP call naturally passes the traceparent header. An async job (enqueue a job, process it 30 seconds later) breaks the trace chain — there is no header to pass. Solutions: (1) store trace context in the job payload: when enqueueing, include {"traceparent": inject_headers(current_ctx)} in the job payload. When the worker starts, extract_context(job.payload["traceparent"]) and create a child span with that as parent. This creates a "linked" trace — the original HTTP request trace shows the job was enqueued, and the worker trace shows processing. (2) Trace links (OpenTelemetry): instead of parent-child relationship, use a "link" to reference the parent trace without being its child. This is semantically more accurate (async work is not a synchronous child). Most trace viewers (Jaeger, Tempo) support linked traces for visualization. (3) Propagate through Kafka message headers or SQS message attributes using the same traceparent header format.

Question 5

How do you use distributed traces to identify N+1 query problems in a microservices architecture?

Accepted Answer

N+1 in microservices: a service fetches a list of 50 orders, then makes 50 individual API calls to the user-service to get each user's name — instead of one batched call. In a trace waterfall view: one "list_orders" span has 50 child spans each calling "get_user" — a clear visual pattern of many sequential identical operations. Detection: in the trace query interface, find traces where: (1) span.operation matches a pattern ("get_user") and COUNT(*) per trace > 10; (2) total duration >> any single span duration (the sum of 50 × 5ms calls = 250ms, where one batched call would be 10ms). The trace makes it immediately visible; without tracing you would only see elevated total latency with no indication of the root cause. Fix: add a batch "get_users_by_ids" endpoint to the user-service and call it once with all 50 IDs. Verify the fix by rerunning the same flow and observing the trace collapse from 51 spans to 2.

Distributed Tracing System Low-Level Design: W3C Traceparent, Span Recording, Sampling, and Trace Visualization

Distributed Tracing System: Low-Level Design

Core Data Model

Trace Context Propagation (W3C traceparent)

Span Recording SDK

Query: Finding Slow and Errored Traces

Key Design Decisions