Question 1

What is the difference between Prometheus pull-based and push-based metrics collection?

Accepted Answer

Prometheus uses a pull model: the Prometheus server periodically scrapes (HTTP GET) metrics endpoints exposed by applications. The application exposes /metrics, and Prometheus fetches it every 15-30 seconds. Advantages of pull: (1) The Prometheus server controls the scrape rate -- applications cannot overwhelm it by pushing too fast. (2) If an application is down, Prometheus detects it immediately (scrape fails) -- no ambiguity between the application being silent and the application being dead. (3) No need for applications to know where to send metrics -- they just expose an endpoint. Push-based systems (Graphite, InfluxDB, Datadog Agent) require applications to actively send metrics to a collector. Advantages of push: works for short-lived jobs (batch jobs that complete before Prometheus scrapes), works through firewalls (the application pushes outbound, no inbound port needed), and works for serverless functions. Prometheus supports push via the Pushgateway for short-lived jobs: the job pushes metrics to the Pushgateway, and Prometheus scrapes the Pushgateway. OpenTelemetry supports both models: the OTel Collector can scrape Prometheus endpoints and also receive pushed OTLP data.

Question 2

How do you correlate metrics, logs, and traces for incident investigation?

Accepted Answer

Correlation is the golden path for incident investigation: metrics tell you something is wrong, logs explain what happened, and traces show where the bottleneck is. Implementation: (1) Include trace_id in all log entries. When the OTel SDK creates a span, it sets the trace_id in the logging context (MDC in Java, context variables in Python). The structured JSON log includes trace_id as a field. (2) Add deployment annotations to Grafana dashboards. When you see a metric anomaly, the annotation shows whether a deployment occurred at that time. (3) Grafana data source linking: configure Grafana to link from a metric panel to Loki logs filtered by the same time window and service label. From a log entry with a trace_id, link to Tempo or Jaeger to view the full trace. Investigation workflow: (1) Alert fires: error rate SLO burn rate exceeds threshold. (2) Open Grafana dashboard: identify the time window and affected service from the error rate graph. (3) Click to Loki logs: filter by service and time window, see error messages and stack traces. (4) Find a trace_id in the error log, click to Tempo: see the full request trace, identify the failing span (a database timeout, a downstream service 500). (5) Root cause identified in under 5 minutes.

Question 3

How does Grafana Loki differ from Elasticsearch for log storage?

Accepted Answer

Elasticsearch indexes the full content of every log line, creating an inverted index that enables fast full-text search across any field. This is powerful but expensive: indexing requires significant CPU and memory, and the index itself can be larger than the raw log data. Cost: Elasticsearch logging infrastructure often costs 5-10x the application infrastructure it monitors. Grafana Loki takes a different approach: it indexes only the log labels (service name, namespace, pod name, log level) and stores the log content as compressed chunks in object storage (S3). Queries filter by labels first (fast, indexed), then grep through the matching log chunks (slower for full-text search). This is like grep with an index on the filename but not the file content. Cost advantage: Loki storage costs are 10-100x lower than Elasticsearch because object storage is cheap and there is no content indexing overhead. Trade-off: Loki is slower for broad text searches across large time ranges. Searching for a specific error message across all services for the past 30 days is fast in Elasticsearch (indexed) but slow in Loki (must scan all matching chunks). Loki excels when you know the service and time range and need to see recent logs -- the most common debugging pattern.

Question 4

What are the RED and USE methods for monitoring microservices?

Accepted Answer

RED and USE are complementary monitoring frameworks. RED (Rate, Errors, Duration) monitors the request-driven workload of each service. Rate: requests per second (throughput). Errors: the number or rate of failed requests (HTTP 5xx, gRPC errors). Duration: the distribution of request latency (P50, P95, P99). RED answers: is the service handling traffic? Is it failing? Is it slow? Every microservice should have a RED dashboard. RED was proposed by Tom Wilkie (Grafana) as the service-level equivalent of Google Four Golden Signals (latency, traffic, errors, saturation). USE (Utilization, Saturation, Errors) monitors infrastructure resources. For each resource (CPU, memory, disk, network): Utilization is the percentage of the resource in use (CPU at 75%). Saturation is the degree to which the resource is overloaded (request queue length, number of threads waiting). Errors are hardware or software errors related to the resource (disk I/O errors, OOM kills). USE answers: is the infrastructure healthy? Is anything at capacity? USE was proposed by Brendan Gregg. In practice: use RED for application dashboards and alerting, USE for infrastructure dashboards and capacity planning.

System Design: Observability Stack — Prometheus, Grafana, ELK, OpenTelemetry, Metrics, Logs, Traces, Alerting

The Three Pillars of Observability

Metrics with Prometheus

Logging with the ELK Stack

Distributed Tracing with OpenTelemetry

Alerting Architecture

Grafana: Unified Observability Dashboard

Designing an Observability Strategy