Low Level Design: Centralized Logging System

Log Ingestion Pipeline

Every host in the fleet runs a lightweight log shipping agent. Filebeat or Fluent Bit tails log files from disk (or reads from stdout via the container runtime) and forwards records to an aggregator tier. The aggregator—typically another Fluent Bit or Logstash instance—buffers incoming records in memory or on disk, batches them, and writes to Kafka. Kafka is the durable transport layer between producers and the indexing backend: it absorbs write spikes, provides replay capability if the indexer falls behind, and decouples producers from consumers. Backpressure is applied upstream: if the Kafka topic’s consumer lag exceeds a threshold the aggregator slows its flush rate and the agent pauses tailing until the buffer drains. This prevents a slow indexer from causing log loss at the source.

Structured Log Format

Unstructured free-text logs are operationally expensive to query. Every service must emit JSON logs with a mandatory set of fields: timestamp (RFC3339 with millisecond precision), level (DEBUG/INFO/WARN/ERROR/FATAL normalized to uppercase), service (canonical service name from deployment config), trace_id, span_id, host, and message. The trace_id is the most operationally important field: it allows you to jump from a trace in your APM tool directly to all log lines emitted during that request across every service. Additional structured fields—user_id, order_id, HTTP status code—should be top-level keys rather than buried in the message string. Structured fields enable faceted search ("all ERROR logs for service=checkout where status_code=500") and cardinality-safe aggregations in Elasticsearch.

Log Parsing and Enrichment

Not all logs arrive pre-structured. Legacy services, third-party libraries, and system logs emit free-text lines. A stream processing layer—Logstash or Apache Flink—consumes raw records from Kafka, applies Grok patterns to extract fields from unstructured messages (e.g., parsing nginx access log format into method, path, status, bytes, duration), and normalizes log levels across language ecosystems where conventions differ (Python’s WARNING vs Java’s WARN). After parsing, enrichment adds context that the service itself doesn’t know: geographic region from host labels, team ownership from a service catalog lookup, environment tag (prod/staging) from the Kafka topic name. Enriched records are written back to a separate Kafka topic for consumption by the indexer. Separating parsing from indexing lets you replay raw logs through an updated parser without re-ingesting from sources.

Indexing with Elasticsearch

The indexing layer writes enriched records into Elasticsearch. The standard index naming convention is logs-{service}-{YYYY.MM.dd}: one index per service per day. Index Lifecycle Management (ILM) policy controls rollover at 50 GB or 24 hours (whichever comes first) and manages the hot-warm-cold tier migration automatically. The hot tier runs on NVMe SSDs with high I/O and holds recent data. Elasticsearch’s inverted index on the message field enables full-text search across billions of log lines. Structured fields (level, service, trace_id) are indexed as keyword type for exact-match filtering and aggregations. Avoid indexing high-cardinality fields like raw user IDs as text—it bloats the index and slows queries. Use keyword with doc values for those instead.

Search and Query Interface

Kibana provides the primary query interface. Engineers write KQL (Kibana Query Language) queries to filter by service, level, time range, and structured fields. Common debugging patterns—"all errors for trace_id X", "last 100 lines for pod Y"—are saved as named searches and shared via team wiki. Grafana Loki is an alternative for teams already using Grafana: LogQL is similar to PromQL and integrates naturally with metric dashboards, making it easy to jump from a latency spike panel directly to correlated logs. Critically, every alert notification should include a pre-built Kibana or Loki URL with the relevant trace_id, time range, and service filters already applied so the on-call engineer lands directly in context instead of starting from scratch.

Sampling and Volume Control

High-volume services can generate terabytes of debug logs per day. Indexing everything is expensive and the signal-to-noise ratio at DEBUG level is low. The solution is tiered sampling: always capture ERROR and above without sampling; sample WARN at 10%; sample INFO at 50% for most services; sample DEBUG at 1% or disable entirely in production. The sampling decision is made at the agent level—before records enter the network—to avoid paying ingestion and transport costs for logs that will be dropped later. Importantly, sampling should be coordinated with distributed tracing: if a trace is sampled for recording (head-based decision), all log records for that trace_id should bypass sampling and be fully captured. This ensures that sampled traces have complete correlated logs.

Retention and Tiering

Retention policy is a cost and compliance decision. A practical default tiering: hot tier on SSD for 7 days (fast query, high cost); warm tier on HDD for 30 days (moderate query speed, lower cost, ILM migrates shards automatically); cold tier on object storage (S3 or GCS) as frozen/searchable snapshots for 90 days; delete after 1 year. Audit logs for regulated industries (authentication events, admin actions, payment records) require longer retention—typically 7 years—and must be stored immutably. Implement object storage versioning and S3 Object Lock for audit log buckets to prevent tampering. Cost optimization: compress indices before warm tier migration (best_compression codec); use force merge to reduce shard count on cold data.

Alerting on Log Patterns

Log-based alerting complements metric-based alerting and catches errors that don’t surface in metrics. Elasticsearch Watcher or the open-source ElastAlert monitors for: error rate per service exceeding a threshold (count of level=ERROR in last 5 minutes > N); regex match on critical patterns like "OutOfMemoryError" or "deadlock detected"; spike detection where error volume increases by 3x relative to the previous hour baseline. Alerts route to PagerDuty for critical patterns requiring immediate response and to Slack for warnings. Every alert payload must include the service name, matched pattern, count, and a direct Kibana link pre-filtered to the relevant time window and service—without this, on-call engineers waste minutes reconstructing context during incidents.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top