Question 1

What is distributed tracing and how does it work?

Accepted Answer

Distributed tracing tracks a single request as it propagates through multiple microservices. A trace is the complete end-to-end journey; a span is one unit of work within that trace (one service, one database query, one external call). Each span records: trace_id (same across the whole request), span_id (unique within the trace), parent_span_id (who called this span), start time, duration, service name, and key-value tags. When service A calls service B, it injects the trace_id and current span_id into the HTTP request headers (W3C traceparent header: 00-{trace_id}-{parent_span_id}-01). Service B creates a child span with A span as parent, establishing a causal hierarchy. The resulting trace shows the entire call tree with timing, revealing which service caused a slowdown or error.

Question 2

What is the difference between head-based and tail-based sampling in distributed tracing?

Accepted Answer

At 10,000 requests/second, storing all traces is prohibitively expensive -- sampling selects which traces to keep. Head-based sampling: the decision is made at the start of the request (at the first service) before any processing. Simple to implement: the decision propagates to all downstream services via context headers. Problem: you cannot prefer slow or errored traces because the outcome is unknown at the start -- you might discard the most interesting traces. Tail-based sampling: buffer all spans and make the sampling decision after the request completes. Can preferentially keep 100% of slow requests (duration > P95), 100% of errored requests, and a small fraction (0.1%) of successful fast requests. Requires a trace aggregation backend that collects all spans from all services before deciding. Honeycomb Refinery and AWS X-Ray implement tail-based sampling.

Question 3

What is OpenTelemetry and why is it the standard for instrumentation?

Accepted Answer

OpenTelemetry (OTel) is a CNCF standard for observability instrumentation that covers traces, metrics, and logs with a single SDK. Before OTel, each tracing vendor (Jaeger, Zipkin, Datadog, New Relic) required their own SDK -- switching vendors meant re-instrumenting the entire codebase. OTel provides a vendor-neutral API and SDK: instrument once, export to any backend via pluggable exporters. OTel auto-instrumentation agents automatically trace HTTP calls, database queries, and message queue interactions without code changes (Java and Python agents are especially capable). The OTLP protocol (OpenTelemetry Protocol) is the standard wire format for sending telemetry data from services to collectors and backends. OTel is now supported natively by AWS, Google Cloud, Datadog, Honeycomb, Grafana Tempo, and Jaeger.

Question 4

How do you propagate trace context through asynchronous message queues?

Accepted Answer

HTTP trace context propagation (via traceparent header) works natively for synchronous calls. For async message queues (Kafka, SQS, RabbitMQ), inject the trace context into message headers or message attributes alongside the payload. When publishing a Kafka message, add the current trace context as Kafka record headers (W3C traceparent format). When the consumer processes the message, extract the trace context from the headers and create a new span with the message span as parent -- creating a trace that spans across the queue boundary. This enables end-to-end tracing from the HTTP request that published the message through to the consumer that processed it, showing the complete causal chain including queue wait time. OpenTelemetry Kafka instrumentation handles this propagation automatically.

Low Level Design: Distributed Tracing Design and Internals

Trace Structure: Spans and Context Propagation

Sampling Strategies

Trace Storage and Query

Key Interview Discussion Points