Load Testing Service Low-Level Design: Scenario Definition, Worker Distribution, and Real-Time Metrics

What Is a Load Testing Service?

A load testing service generates synthetic traffic against a target system to measure its behavior under stress. Unlike single-machine tools, a distributed load testing service coordinates many worker nodes to produce traffic volumes that exceed what one machine can generate, while aggregating results into coherent real-time metrics and final reports.

Requirements

Functional Requirements

  • Define test scenarios with target URLs, HTTP methods, headers, body templates, and think-time distributions.
  • Specify load profiles: constant rate, ramp-up, step function, or spike patterns.
  • Distribute scenario execution across a configurable number of worker nodes.
  • Collect per-request latency, status codes, and throughput in real time.
  • Generate a final report with percentile latencies, error rate, and throughput over time.

Non-Functional Requirements

  • Support generating up to 500,000 requests per second across the worker fleet.
  • Real-time metrics must update with under 2-second lag during a live test.
  • Worker failure must not abort the test; the coordinator must redistribute load.
  • Scenario definitions must be storable and reusable across multiple test runs.

Data Model

Scenario

  • scenario_id (UUID)
  • name, description
  • steps (JSONB array: method, url_template, headers, body_template, think_time_ms)
  • virtual_users, duration_seconds, ramp_up_seconds

TestRun

  • run_id (UUID)
  • scenario_id (foreign key)
  • status (ENUM: queued, running, completed, failed, aborted)
  • worker_count, started_at, ended_at
  • target_base_url

Raw per-request events are written to a time-series store rather than the relational DB. The relational store holds scenario definitions, run metadata, and aggregated report snapshots.

Core Algorithms

Load Distribution

The coordinator divides the target virtual user count evenly across available workers. If virtual_users = 1000 and worker_count = 10, each worker runs 100 virtual users. Each virtual user is a coroutine that loops through the scenario steps, pausing for the configured think time between steps. The coordinator sends each worker its assigned virtual user count and the scenario payload via a control channel (gRPC stream or message queue topic).

Token Bucket Rate Limiting Per Worker

For scenarios specifying a fixed request rate rather than a virtual user count, each worker uses a token bucket to pace its request emission. The bucket refills at rate/worker_count tokens per second. This prevents bursty behavior and ensures aggregate throughput matches the configured target rate.

Real-Time Metric Aggregation

Each worker maintains a local HDR histogram of response latencies, reset every 5 seconds. At each flush interval, the worker serializes the histogram and sends it to the coordinator. The coordinator merges histograms from all workers using HDR Histogram merge semantics, producing accurate global percentiles (p50, p95, p99) without requiring all raw data points to travel to a central location.

Scalability

Workers are stateless executors and can be added or removed mid-test. When a new worker joins, the coordinator redistributes virtual users: existing workers shed users proportionally, and the new worker takes on its share. When a worker drops off, its virtual users are reassigned to surviving workers up to their configured maximum capacity. If capacity is exhausted, the test continues at reduced load rather than failing.

The coordinator uses a distributed lock (Redis SETNX with TTL) to ensure only one coordinator instance is active per run, preventing duplicate metric processing during coordinator failover.

API Design

  • POST /scenarios — create a reusable test scenario definition.
  • POST /runs — start a test run from a scenario; specify target URL, worker count, and optional overrides.
  • GET /runs/{run_id}/metrics — stream real-time aggregated metrics (throughput, p99 latency, error rate) via Server-Sent Events.
  • POST /runs/{run_id}/abort — stop a running test immediately.
  • GET /runs/{run_id}/report — retrieve the final report including time-series data, percentile summaries, and per-step breakdown.

Failure Modes

  • Target system returns 5xx: Errors are recorded and counted; workers continue sending traffic unless the error rate exceeds a configured abort threshold.
  • Worker loses network to coordinator: Worker continues executing locally, buffering metrics. On reconnect, buffered histogram snapshots are flushed in order.
  • Coordinator crash: Standby coordinator detects expired lease, takes over, and re-registers with all workers via their advertised addresses.

Observability

Expose live charts of requests per second, p50/p95/p99 latency, error rate, and active virtual user count per worker. Post-run reports should include a time-series graph for each metric, a latency histogram, and a table of per-endpoint statistics. Store reports indefinitely so teams can compare load test results across releases.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a scenario definition DSL in a load testing system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A scenario DSL is a declarative language (often YAML or a thin scripting layer) that lets engineers describe virtual user flows — HTTP request sequences, think times, ramp-up/ramp-down curves, and parametrized data sets — without writing orchestration code. The DSL is compiled into an execution plan that worker nodes interpret independently.”
}
},
{
“@type”: “Question”,
“name”: “How do distributed worker nodes coordinate during a load test?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A central controller distributes the scenario plan and target virtual-user count across worker nodes, each responsible for a shard of the load. Workers report heartbeats and partial metric aggregates back to the controller on a fixed interval. The controller merges aggregates, enforces ramp schedules, and can signal workers to stop or adjust rate if targets drift.”
}
},
{
“@type”: “Question”,
“name”: “How are real-time throughput and latency metrics collected during a load test?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each worker maintains a local ring buffer of request timestamps and response durations. On each reporting tick the worker computes throughput (requests/second) and latency percentiles (p50, p95, p99) from the buffer and streams them to a time-series aggregator. The controller merges per-worker streams using a merge-sort or HDR histogram merge to produce cluster-wide metrics.”
}
},
{
“@type”: “Question”,
“name”: “What does a load test report generation pipeline produce?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The pipeline consumes the stored time-series metric data after a test run and produces a structured report containing throughput and latency charts over time, error rate breakdown by status code, percentile summary tables, and a comparison diff if a baseline run is provided. Reports can be rendered as HTML, PDF, or JSON for CI integration.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top