Log Pipeline Low-Level Design: Collection, Aggregation, Parsing, and Storage Tiering

Log Pipeline Low-Level Design

A production log pipeline moves log data from every host and container to durable storage where it can be searched and analyzed. The key challenges are: collecting reliably without losing data, parsing unstructured text into queryable fields, and controlling storage costs across hot, warm, and cold tiers.

Collection: Log Agents

Deploy a log agent (Filebeat or Fluentd) on every host. The agent tails log files and ships events to Kafka. For Kubernetes workloads, run the agent as a DaemonSet so every node is covered, or as a sidecar container injected into each pod when per-pod configuration is needed.

Each agent maintains a local disk buffer. If Kafka becomes unavailable the agent continues writing to disk and flushes the backlog once connectivity is restored. This prevents log loss during downstream outages without blocking the application.

Transport: Kafka

Kafka decouples collection from processing. Create topics segmented by service and severity (e.g., logs.payments.error, logs.api.info). Partition by hostname to preserve per-host ordering. Retention on the Kafka topic is set to 24 hours, acting as a buffer in case the parsing consumer falls behind.

Parsing Pipeline

A consumer reads from Kafka and applies grok or regex patterns to extract structured fields from unstructured log lines. A typical Apache access log yields:

timestamp — normalized to ISO 8601 UTC
level — DEBUG, INFO, WARN, ERROR
message — human-readable text
trace_id — distributed trace correlation
user_id — identity for audit queries

Application logs should emit JSON natively to avoid brittle regex. Enforce a consistent field naming schema across all services so queries work the same regardless of origin.

Enrichment

After parsing, the consumer enriches each event before routing:

Kubernetes metadata — pod name, namespace, deployment name (fetched from the Kubernetes API at agent start, cached locally)
Geo lookup — country and city from the client IP using a local MaxMind database to avoid external latency
Service version — resolved from a deployment registry by matching pod labels to a release version

Routing

After enrichment, route each log event based on severity and service:

ERROR / CRITICAL → Elasticsearch hot tier + PagerDuty alert trigger
WARN / INFO → Elasticsearch hot tier only
DEBUG → S3 cold storage only (too high volume for Elasticsearch)

Routing rules are configuration-driven so they can be updated without redeploying the consumer.

Storage Tiers

Hot — Elasticsearch: SSD-backed nodes, fast full-text search, 7-day retention. Sized for peak query load from on-call engineers and dashboards.
Warm — Elasticsearch frozen indices: HDD-backed or object-store-backed frozen indices, 30-day retention. Queries are slower but cost is 5–10x lower than hot. Used for incident retrospectives.
Cold — S3 Parquet: Indefinite retention. Logs are converted to columnar Parquet on write. Athena queries S3 directly via SQL for ad-hoc analysis and compliance requirements.

Index Lifecycle Management

Elasticsearch ILM policies automate tier transitions. A policy triggers:

Hot → Warm after 7 days or when the index exceeds 50 GB
Warm → Delete after 30 days

S3 lifecycle rules move Parquet files to Glacier after 90 days for further cost reduction.

Query Interfaces

Kibana — full-text search and dashboards over Elasticsearch; primary interface for on-call engineers
Athena SQL — ad-hoc SQL over S3 Parquet; used for long-range queries and compliance reporting
Grafana Loki — lightweight log streaming with label-based filtering; low-cost alternative for teams with simpler search needs

Log Rate Limiting

During traffic spikes, DEBUG and INFO log volume can grow 10–100x and drive storage costs up sharply. Implement a per-service rate cap at the agent level: once a service exceeds its configured events-per-second budget, the agent drops lower-severity events and emits a counter metric indicating how many were dropped. ERROR events are never dropped.

Structured Logging Best Practices

Emit JSON from the application — avoids brittle grok patterns and parsing errors
Use consistent field names across all services (e.g., always trace_id, never traceId or request_id)
Include a correlation ID in every log line that links to the trace in your tracing system
Never log secrets, PII, or credentials — enforce with a linting rule in CI

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design the log collection layer to handle bursty traffic without dropping events?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Deploy a sidecar or node-level agent (e.g., Fluentd, Vector) that tails application log files or listens on a local Unix socket. The agent buffers events in a bounded in-memory queue and spills to a local disk buffer when the queue is full, preventing drops during downstream backpressure. Use a durable message bus (Kafka) as the collection backbone: agents batch-produce to Kafka with acknowledgment from the leader only (acks=1) for throughput, or all in-sync replicas (acks=all) for durability — choose based on loss tolerance. Size Kafka partitions so each partition handles a predictable throughput per second, and auto-scale consumers. The disk-spill buffer on the agent provides resilience if Kafka is briefly unreachable.”
}
},
{
“@type”: “Question”,
“name”: “Describe an aggregation architecture that computes per-service error rates over a sliding window.”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a stream processing engine (Flink, Kafka Streams) consuming from the log Kafka topic. Define a keyed stream partitioned by service name, then apply a sliding window (e.g., 5-minute window sliding every 30 seconds). Within each window, count total log events and events with level=ERROR, then emit a rate metric. Store windowed aggregates in a time-series database (Prometheus remote write, or InfluxDB) for alerting and dashboarding. Handle late-arriving events with an allowed lateness parameter — events arriving up to 60 seconds late are folded into the correct window before it is finalized. For very high cardinality (per-endpoint error rates), use approximate counting with Count-Min Sketch to bound memory usage per window.”
}
},
{
“@type”: “Question”,
“name”: “What parsing strategy handles heterogeneous log formats from dozens of services reliably?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Mandate structured logging (JSON) at the application level as the primary contract — services emit a canonical envelope with required fields (timestamp, service, level, trace_id) plus an arbitrary 'fields' map. For legacy services emitting unstructured text, maintain a per-service Grok pattern library in a config store. The parser stage looks up the service name in a routing table, selects the appropriate parser (JSON fast-path or Grok pattern), and emits a normalized event. Invalid or unparseable events are routed to a dead-letter queue rather than dropped, and a counter metric tracks parse failure rate per service so teams are alerted when their log format breaks. Validate mandatory fields at parse time and enrich with Kubernetes pod metadata via the node agent.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement storage tiering to balance query speed against retention cost?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Define three tiers. Hot tier: recent logs (0–7 days) stored in Elasticsearch or ClickHouse with full-text indexes for sub-second queries; sized for your peak query concurrency. Warm tier: 7–30 days stored in columnar format (Parquet on S3 or GCS) with a query engine like Trino or Athena; queries take seconds to minutes but storage cost is 10–20x cheaper than hot. Cold tier: 30 days–2 years stored as compressed gzip-Parquet in object storage with a lifecycle policy; queries are ad hoc and slow. Automate tier transitions with a compaction job that reads from Elasticsearch, writes Parquet to S3, and deletes the Elasticsearch index after verifying the S3 write. Maintain a metadata catalog (Glue or Hive Metastore) so the query layer can route to the correct tier based on the query's time range.”
}
}
]
}