A centralized logging system collects, indexes, and makes searchable the logs from hundreds or thousands of microservices. Without it, debugging a production issue requires SSH-ing into individual servers and grepping through files — impossible at scale. This guide covers the architecture of a production logging pipeline from collection to search, with focus on cost optimization — the dominant concern at scale where logging infrastructure often costs more than the application itself.
Structured Logging: The Foundation
Unstructured logs (“User 123 placed order 456 at 2026-04-20T10:30:00Z”) require regex parsing to extract fields. At scale, parsing millions of logs per second with regex is CPU-intensive and error-prone. Structured logging emits logs as JSON with consistent field names: {“timestamp”: “2026-04-20T10:30:00Z”, “level”: “info”, “service”: “order-service”, “trace_id”: “abc123”, “user_id”: 123, “order_id”: 456, “message”: “Order placed”, “duration_ms”: 45}. Benefits: (1) No parsing needed — the log collector forwards JSON directly to the indexer. (2) Consistent field names enable efficient indexing and filtering: service:”order-service” AND level:”error”. (3) Correlation — trace_id links logs across services for distributed tracing. (4) Machine-readable — automated alerting and anomaly detection can process structured fields without regex. Every log entry must include: timestamp (ISO 8601), level (debug/info/warn/error), service name, trace_id (for request correlation), and message. Additional context fields (user_id, order_id, duration) make logs more useful for debugging.
Log Collection Pipeline
Collection architecture: (1) Application writes logs to stdout (container best practice). The container runtime captures stdout and writes to a log file on the node (/var/log/containers/). (2) A log collector agent runs as a DaemonSet on each Kubernetes node. Fluent Bit (lightweight, C-based) or Fluentd (feature-rich, Ruby-based) tails the container log files, parses JSON, enriches with Kubernetes metadata (pod name, namespace, node, labels), and forwards to the log aggregator. (3) A buffer/aggregator (Kafka or Fluentd aggregator) absorbs traffic spikes and provides durability. Without a buffer, a logging backend outage causes log loss. Kafka retains logs for hours/days, allowing the backend to catch up. (4) The log indexer (Elasticsearch, Loki, or a cloud service) receives logs, indexes them, and makes them searchable. Why Fluent Bit over Fluentd: Fluent Bit uses 10-50MB of memory per node vs 100-500MB for Fluentd. For a 1000-node cluster, this saves significant resources. Use Fluent Bit for collection and Fluentd for aggregation/routing (if complex routing rules are needed).
Elasticsearch vs Grafana Loki
Elasticsearch: indexes the full content of every log line using an inverted index. Enables fast full-text search across any field (“find all logs containing timeout error”). Powerful but expensive: the inverted index consumes significant CPU, memory, and storage. A cluster ingesting 1 TB of logs per day typically needs 10-20 nodes with 64 GB RAM each. Storage: the index can be larger than the raw log data. Grafana Loki: indexes only labels (service name, namespace, pod, log level), not log content. Log content is stored as compressed chunks in object storage (S3). Queries filter by labels first (fast, indexed), then grep through matching chunks (slower for broad text searches). Cost: 10-100x cheaper than Elasticsearch because object storage is cheap and there is no content indexing overhead. Trade-off: Loki is slower for broad text searches across large time ranges. Searching for a specific error message across all services for 30 days is fast in Elasticsearch (indexed) but slow in Loki (must scan chunks). Loki excels when you know the service and time range — the most common debugging pattern. Decision: use Elasticsearch when full-text search across all logs is critical (security analysis, compliance auditing). Use Loki when cost is a concern and most queries filter by service + time range (the typical debugging workflow).
Log Retention and Cost Optimization
Log storage is the largest cost in observability. A medium-sized company generating 5 TB of logs per day stores 150 TB per month. At Elasticsearch pricing, this is enormous. Cost optimization strategies: (1) Log level filtering — drop debug and trace logs in production. Only forward info, warn, and error. Debug logs are 50-80% of total volume. Filter at the collection agent (Fluent Bit) before sending to the backend. (2) Sampling — for high-volume, low-value logs (health check access logs, heartbeat logs), sample 10% instead of keeping all. (3) Tiered retention — keep full-resolution logs for 7 days in hot storage (Elasticsearch/Loki with SSD). Move to warm storage (cheaper nodes, compressed) for 30 days. Archive to S3 for 1 year (for compliance). Delete after the compliance period. (4) Index lifecycle management (ILM) — Elasticsearch ILM automates the hot -> warm -> cold -> delete transitions based on index age. (5) Field reduction — do not index fields that are never searched. Stack traces are valuable in error logs but do not need to be indexed as individual terms. Store as a non-indexed field. (6) Compression — Loki compresses log chunks 5-10x. Elasticsearch uses LZ4 compression on stored fields. Both reduce storage significantly.
Log-Based Alerting and Anomaly Detection
Logs complement metrics for alerting: metrics tell you something is wrong (error rate increased); logs tell you what is wrong (the specific error message and context). Log-based alerts: (1) Error rate from logs — count error-level logs per service per minute. Alert if the count exceeds a threshold or deviates from the baseline. (2) Specific error patterns — alert when a log matches a critical pattern: “database connection refused”, “out of memory”, “certificate expired”. (3) Absence detection — alert if a specific log is NOT seen within a time window. Example: a scheduled job should log “daily report completed” every day. Alert if the log is missing for 25 hours. Implementation: Elasticsearch Watcher or Loki alerting rules (LogQL queries evaluated periodically). Anomaly detection: ML-based log analysis detects unusual patterns: (1) Log volume anomalies — a service suddenly emitting 10x its normal log volume indicates a problem (error loop, verbose logging accidentally enabled). (2) New error types — an error message that has never been seen before may indicate a new bug. Tools: Elastic ML (built into Elasticsearch), or custom anomaly detection on log metrics exported to Prometheus.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why is structured logging important and what fields should every log include?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Structured logging emits logs as JSON with consistent field names instead of unstructured text strings. Benefits: no regex parsing needed (the log collector forwards JSON directly), consistent fields enable efficient filtering (service:order-service AND level:error), trace_id links logs across services for distributed tracing, and automated systems can process structured fields without regex. Every log entry must include: timestamp (ISO 8601), level (debug/info/warn/error), service name, trace_id (for cross-service request correlation), and message. Additional context fields (user_id, order_id, duration_ms) make debugging faster. Without structured logging, finding all error logs for a specific user across 50 microservices requires different regex patterns for each service log format.”}},{“@type”:”Question”,”name”:”When should you use Elasticsearch versus Grafana Loki for logs?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Elasticsearch: indexes full log content using an inverted index. Enables fast full-text search across any field. Powerful but expensive: a cluster ingesting 1 TB/day typically needs 10-20 nodes with 64 GB RAM each. The index can be larger than raw log data. Use when: full-text search across all logs is critical (security analysis, compliance auditing). Grafana Loki: indexes only labels (service, namespace, level), not content. Log content is compressed in S3. Queries filter by labels first (fast), then grep content (slower for broad searches). Cost: 10-100x cheaper than Elasticsearch. Use when: most queries filter by service + time range (typical debugging), and cost is a concern. The trade-off is clear: Elasticsearch provides Google-like search over all logs. Loki provides grep over a well-filtered subset. For most operational debugging (I know the service and approximate time), Loki is sufficient and dramatically cheaper.”}},{“@type”:”Question”,”name”:”How do you optimize logging costs at scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Log storage dominates observability costs. For 5 TB/day: 150 TB/month. Optimization strategies: (1) Drop debug/trace logs in production — they are 50-80% of volume. Filter at the collection agent (Fluent Bit) before reaching the backend. (2) Sample high-volume, low-value logs — health check access logs at 10% sampling instead of 100%. (3) Tiered retention — full resolution for 7 days (hot, SSD), compressed for 30 days (warm), archived to S3 for 1 year (cold, compliance), then delete. (4) Index lifecycle management (ILM) — Elasticsearch automates hot->warm->cold->delete transitions. (5) Field reduction — do not index fields never searched. Stack traces do not need per-word indexing. (6) Compression — Loki compresses 5-10x. These combined typically reduce costs by 5-10x compared to indexing every log at full resolution indefinitely.”}},{“@type”:”Question”,”name”:”How does the log collection pipeline work in Kubernetes?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pipeline: (1) Applications write to stdout (container best practice). The container runtime captures stdout to /var/log/containers/ on the node. (2) Fluent Bit runs as a DaemonSet on each node. It tails container log files, parses JSON, enriches with Kubernetes metadata (pod name, namespace, node, labels), and forwards to the aggregator. Fluent Bit uses 10-50MB RAM per node (vs 100-500MB for Fluentd). (3) Kafka acts as a buffer between collectors and the indexer. Absorbs traffic spikes and provides durability — if the logging backend is down, Kafka retains logs until it recovers. (4) The indexer (Elasticsearch or Loki) receives logs, indexes them, and serves search queries. Without Kafka buffer: a logging backend outage causes log loss. With Kafka: logs are retained for hours/days, and the backend processes the backlog when it recovers.”}}]}