What an Access Log Service Provides
Every HTTP request to a web service generates a log record. At scale, this is millions of records per minute. The access log service must ingest this stream durably at high throughput, make logs searchable for recent debugging, archive efficiently for long-term compliance, and support real-time anomaly detection without becoming the bottleneck in the request path.
Log Schema
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2024-01-15T10:23:45.123Z",
"user_id": "user:12345",
"ip_address": "203.0.113.42",
"method": "GET",
"path": "/api/v1/products",
"query_params": "category=electronics&page=2",
"status_code": 200,
"latency_ms": 47,
"user_agent": "Mozilla/5.0 ...",
"referer": "https://example.com/",
"bytes_sent": 8192,
"trace_id": "abc123def456",
"service_name": "product-api"
}
All fields use consistent names across all services. Structured JSON format makes every field machine-parseable without regex extraction downstream.
Write Pipeline
The path from application to durable storage:
- Application: writes structured JSON log to stdout or a local Unix socket. No network call in the request path.
- Local log agent (Vector, Fluentd, Filebeat): tails stdout, buffers in a local ring buffer (survives brief agent restarts), batch-sends to Kafka every 1 second or 1,000 events — whichever comes first.
- Kafka topic
access-logs: durable, partitioned byservice_namefor ordered processing per service. Retention: 7 days. - Kafka consumers: fan out to parallel sinks — Elasticsearch for hot search, S3 for archival, Flink for real-time processing.
Batching Benefits
Writing 1,000 log events as a single compressed batch to Kafka costs roughly the same network round-trip as writing one event. Batching reduces:
- Write amplification on Kafka brokers
- Number of S3 PUT requests (S3 charges per request)
- CPU overhead from TLS handshakes per small write
The local agent absorbs burst traffic in its ring buffer, smoothing the write rate to Kafka.
S3 Archival for Long-Term Storage
Kafka consumers write to S3 in Parquet format, partitioned by time and service:
s3://logs/{service_name}/year=2024/month=01/day=15/hour=10/part-0001.parquet.gz
Parquet's columnar format enables Athena to scan only the columns needed for a query (e.g., only status_code and latency_ms), dramatically reducing scan costs compared to JSON. GZIP compression reduces storage and transfer costs by 5-10x over raw JSON.
Athena Query Interface
AWS Athena (or any query engine over S3) is configured with an external table:
CREATE EXTERNAL TABLE access_logs (
request_id STRING,
timestamp TIMESTAMP,
service_name STRING,
status_code INT,
latency_ms INT,
...
)
STORED AS PARQUET
LOCATION 's3://logs/'
PARTITIONED BY (service_name STRING, year INT, month INT, day INT)
Queries targeting a specific service and date range scan only the relevant partitions — sub-second for typical analyses on billions of records.
Real-Time Stream Processing
A Flink job consumes from the access-logs Kafka topic in real time, computing:
- Error rate per service per minute (alert if > 1% for 3 consecutive minutes)
- Latency p99 per endpoint per minute
- Unusual IP address behavior (single IP > 1,000 requests per minute → scraper alert)
- Sudden traffic spike per service (3x normal rate → capacity alert)
Flink emits alerts to PagerDuty and publishes real-time metrics to Prometheus via a metrics sink.
Log Sampling
For very high-traffic services (millions of requests per minute), 100% logging is expensive. Selective sampling:
- Success responses (2xx): sample 10%
- Error responses (4xx, 5xx): 100%
- Slow responses (latency > p99 threshold): 100%
Sampling decision is made at the log agent with a consistent hash of request_id to ensure reproducibility. Sampled metrics are upscaled by the sampling factor in dashboards.
Log Retention Policy
- Hot (Elasticsearch): 7 days. Full-text search, sub-second queries for recent debugging. Expensive per GB.
- Warm (S3 Parquet): 90 days. SQL queries via Athena. Low cost, seconds to minutes for complex queries.
- Cold (S3 Glacier): 7 years. Required for PCI-DSS, SOC2, HIPAA compliance. Retrieval takes hours, cost is minimal.
S3 lifecycle policies automate the transition: objects move from S3 Standard → S3 Infrequent Access → S3 Glacier automatically based on age.
PII in Logs and Log-Based Metrics
Before writing to any sink, the log agent strips or hashes PII from query_params and path: remove password fields, hash user-identifying values, drop request bodies. Refer to the PII scrubber design for the detection pipeline.
The Kafka consumer also extracts RED metrics (Rate, Errors, Duration) from the log stream and pushes them to Prometheus, providing per-endpoint metrics without requiring services to instrument individually.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety