Span Collector Low-Level Design: Ingestion Pipeline, Trace Assembly, and Storage Tiering

A span collector is the ingestion backbone of a distributed tracing system. It receives high-throughput span streams from many services, assembles parent-child relationships into complete traces, and routes spans to the appropriate storage tier based on age and access patterns. This design covers the ingestion pipeline, trace assembly algorithm, and storage tiering with TTL cleanup.

Requirements

Functional

Ingest spans over OTLP/gRPC and OTLP/HTTP from SDK clients and agents.
Buffer spans in memory and group them by traceId for assembly.
Detect complete traces and flush to hot storage within 30 seconds.
Tier completed traces to warm and cold storage based on age.
Run TTL cleanup to expire data beyond retention policy.

Non-Functional

Sustain 500,000 spans/second per collector node.
Memory footprint for in-flight trace buffers bounded at 4 GB per node.
Span loss rate under 0.01% under normal load.

Data Model

IngestBatch — the OTLP payload unit: a list of ResourceSpans, each containing a Resource (service name, host, version attributes) and one or more ScopeSpans lists.
TraceBuffer — traceId, spans (list), hasRootSpan (bool), firstSpanAt, lastSpanAt, spanCount. Stored in a sharded in-memory map.
StoredTrace — traceId, rootServiceName, rootOperationName, startTimeUnixNano, durationNano, spanCount, hasError, spans (serialized), tier (HOT, WARM, COLD), createdAt, expiresAt.
TierPolicy — policyId, hotRetentionHours, warmRetentionDays, coldRetentionDays, coldEnabled.

Ingestion Pipeline

The collector exposes a gRPC endpoint implementing the OTLP TraceService.Export RPC. On receipt, it deserializes the protobuf payload, enriches each span with the Resource attributes (service name, version), and routes spans to a shard of the in-memory trace buffer map using traceId % shardCount. Each shard is protected by its own mutex to reduce lock contention. Enriched spans are also written to a Kafka topic for downstream consumers (tail sampler, metrics aggregator) so the collector is not a processing bottleneck.

Core Algorithms

Trace Assembly

Each incoming span is added to the TraceBuffer for its traceId. If the span has no parentSpanId, set hasRootSpan = true. A background flusher thread scans all buffers every second and flushes a buffer when both conditions are met: hasRootSpan is true AND lastSpanAt is more than assemblyGracePeriodSeconds ago. The grace period allows late-arriving child spans to be included. Buffers with no root span after maxBufferAgeSeconds are flushed as incomplete traces, marked with a flag so queries can distinguish them.

Parent-Child Linking

During flush, sort spans by startTimeUnixNano. Build a map from spanId to span. For each span with a parentSpanId, look up the parent in the map. If found, link the child to the parent in the tree structure. If not found (parent arrived in a different collector shard or was dropped), attach the span as a secondary root and record a parentMissing flag. This handles partial traces gracefully without discarding data.

Storage Tiering

On flush, write the assembled trace to HOT storage (ClickHouse or Elasticsearch) with expiresAt = now + hotRetentionHours. A tiering job runs hourly, queries HOT storage for traces older than hotRetentionHours, serializes them to Parquet files, and writes them to WARM storage (S3 with a metadata index in DynamoDB). Traces older than warmRetentionDays are further compressed and moved to COLD storage (Glacier or equivalent). The metadata index in DynamoDB always knows which tier a traceId lives in.

TTL Cleanup

A cleanup worker runs daily, queries each tier for expired traces (where expiresAt < now), and deletes them in batches. For ClickHouse, use TTL table expressions: TTL toDateTime(expiresAt) DELETE. For S3, use lifecycle rules keyed on object prefix. For DynamoDB metadata, use DynamoDB TTL on the expiresAt attribute.

API Design

POST /v1/traces — OTLP HTTP span ingestion (JSON or protobuf).
gRPC opentelemetry.proto.collector.trace.v1.TraceService/Export — OTLP gRPC ingestion.
GET /internal/buffers/stats — operator endpoint: active buffer count, total spans buffered, oldest buffer age.
POST /internal/buffers/flush — force-flush all buffers (used during graceful shutdown).
GET /internal/tiering/status — last tiering job run time, traces moved, errors.

Scalability and Observability

Scale collector nodes horizontally. Use consistent hashing on traceId at the load balancer layer so all spans of a trace land on the same collector node, enabling single-node assembly without cross-node coordination.
If a collector node fails mid-assembly, spans on that node are lost for in-flight traces. Mitigate by also writing spans to Kafka, allowing a recovery worker to re-assemble from the log.
Emit collector_spans_received_total, collector_traces_flushed_total{complete|incomplete}, collector_buffer_age_seconds (histogram), and collector_memory_bytes gauges.
Alert when collector_buffer_age_seconds p99 > assemblyGracePeriodSeconds * 2 — this indicates the flusher is falling behind.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a span collector achieve high-throughput span ingestion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The collector accepts spans over gRPC or HTTP/2 with batched writes, decodes and validates them in parallel worker pools, and writes to a message queue (e.g., Kafka) to decouple ingestion from storage. Back-pressure is applied by returning flow-control signals to agents when internal queues fill, preventing memory exhaustion under traffic spikes.”
}
},
{
“@type”: “Question”,
“name”: “How does a span collector assemble parent-child trace relationships?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each span carries a trace ID and a parent span ID. The collector groups spans by trace ID in a temporary buffer (memory or Redis), linking each span to its parent using the parent span ID to build the trace tree. Once the trace is considered complete—determined by a root span arrival or a configurable assembly timeout—the assembled trace is flushed to the backend store.”
}
},
{
“@type”: “Question”,
“name”: “What storage tiering strategy is used in a span collector for hot, warm, and cold data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Recent traces (hot, last 24–72 hours) are kept in a fast columnar or time-series store (e.g., ClickHouse, Cassandra) for low-latency queries. Older traces (warm, up to 30 days) move to compressed object storage with an index for trace-ID lookup. Data beyond the warm window (cold) is archived to cheap object storage or deleted according to retention policy, with the tier boundaries configurable per service or environment.”
}
},
{
“@type”: “Question”,
“name”: “How does TTL-based cleanup work in a span collector?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each span or trace record is written with an expiry timestamp derived from the trace's start time plus the configured retention period. A background compaction job or native TTL feature of the storage engine scans for expired records and deletes them in batches during off-peak periods, reclaiming storage without impacting query latency on active data.”
}
}
]
}