Chaos Engineering Platform Low-Level Design: Fault Injection, Blast Radius Control, and Steady-State Hypothesis

Chaos Engineering Platform: Low-Level Design

A chaos engineering platform enables teams to intentionally inject failures into production and staging systems, validate that steady-state behavior is maintained within defined thresholds, and automatically halt experiments when conditions deteriorate. It operationalizes the principles from the Netflix Simian Army and the chaos engineering discipline into a structured, safe, and observable workflow.

Requirements

Functional

  • Define experiments with a steady-state hypothesis: a set of metrics and their acceptable ranges
  • Execute fault injection primitives: network latency, packet loss, CPU stress, memory pressure, process kill, disk I/O throttling, DNS failure
  • Scope blast radius by targeting specific services, instances, pods, or percentage of traffic
  • Continuously evaluate the steady-state hypothesis during experiment execution and auto-halt on violation
  • Provide a rollback mechanism that reverses all injected faults within 30 seconds of halt
  • Record experiment runs with timelines of injected faults, metric values, and halt events

Non-Functional

  • Auto-halt latency: under 10 seconds from hypothesis violation detection to fault rollback
  • Support 20 concurrent experiments across different services
  • Zero permanent infrastructure impact after experiment completion or halt

Data Model

  • experiments: experiment_id (UUID), name, description, hypothesis_spec (JSONB), fault_spec (JSONB), blast_radius_spec (JSONB), duration_seconds (INT), owner_id, created_at
  • experiment_runs: run_id (UUID), experiment_id, started_at, ended_at, status (ENUM: running, completed, halted, failed), halt_reason (TEXT), triggered_by (TEXT)
  • run_events: event_id (UUID), run_id, event_type (ENUM: fault_injected, fault_rolled_back, hypothesis_checked, hypothesis_violated, manual_halt), payload (JSONB), occurred_at (TIMESTAMP)
  • fault_agents: agent_id (UUID), host (TEXT), pod_name (TEXT), namespace (TEXT), capabilities (ARRAY), last_heartbeat (TIMESTAMP), status (ENUM: active, unreachable)
  • hypothesis_checks: check_id (UUID), run_id, checked_at, metric_name (TEXT), expected_range (JSONB), observed_value (FLOAT), passed (BOOL)

Core Algorithms

Blast Radius Scoping

The blast_radius_spec supports three targeting strategies. Percentage-based: inject fault into X percent of instances of a target service, selected randomly using reservoir sampling. Label-based (Kubernetes): target pods matching a label selector (e.g., app=payment-service, env=staging). Explicit: a fixed list of agent IDs. Before execution, the platform resolves the target set, caps it at a configurable maximum (default 20 percent of total instances per service), and requires operator approval for experiments exceeding the cap. This prevents accidental full-fleet fault injection.

Steady-State Hypothesis Evaluation

The hypothesis_spec defines a list of probes, each with a metric source (Prometheus query, HTTP health check, or SLO compliance percentage), an acceptable range (min, max), and a measurement interval. A hypothesis evaluator runs each probe on the interval and compares the observed value to the range. A violation is defined as the observed value falling outside the range for two consecutive checks (to filter transient noise). On violation, the evaluator emits a hypothesis_violated event to the run event log and signals the fault controller to initiate rollback.

Fault Injection and Rollback

Fault agents run as privileged DaemonSet pods (Kubernetes) or systemd services (VMs). Each agent exposes a gRPC API: InjectFault(FaultSpec) and RollbackFault(FaultID). Network faults use Linux tc (traffic control) with netem to add latency or simulate packet loss. CPU stress uses the stress-ng tool. Process kill sends SIGKILL to the target PID. Each injected fault is registered in the agent with a fence token; rollback commands use the token to ensure idempotency. On rollback, the agent reverses tc rules and terminates stress-ng processes, then confirms completion back to the platform.

Scalability and Architecture

The platform has three main components: the Experiment Controller (orchestrates run lifecycle, communicates with agents via gRPC), the Hypothesis Evaluator (queries metrics sources on a polling loop), and the Agent Fleet (per-host fault executors). Experiment state is stored in Postgres; run events are written as an append-only log for audit purposes.

  • The Experiment Controller uses a distributed lock (Postgres advisory lock or Redis SETNX) to prevent duplicate runs of the same experiment concurrently
  • Agent heartbeat: each agent writes last_heartbeat every 5 seconds; agents not seen for 30 seconds are marked unreachable and their active faults are flagged for manual review
  • Auto-halt is implemented as a separate goroutine within the controller that subscribes to hypothesis evaluator events; it does not depend on the polling loop, ensuring sub-10-second halt latency
  • Metrics integration: the evaluator supports PromQL queries against a Prometheus-compatible backend (Thanos, VictoriaMetrics) and HTTP probes with configurable timeouts
  • Audit log: all run_events are exported to an immutable S3-backed store (object lock enabled) for compliance and post-incident analysis

API Design

Experiment Management

  • POST /v1/experiments — create an experiment definition with hypothesis, fault, and blast radius specs
  • GET /v1/experiments/{experiment_id} — retrieve experiment definition
  • POST /v1/experiments/{experiment_id}/runs — start a new run; returns run_id. Requires dry-run option that validates specs without injecting faults.
  • GET /v1/runs/{run_id} — run status, current hypothesis check results, active faults
  • POST /v1/runs/{run_id}/halt — manual halt; triggers immediate fault rollback

Run Observability

  • GET /v1/runs/{run_id}/events?type=STRING — filtered event timeline for a run
  • GET /v1/runs/{run_id}/hypothesis-checks — all hypothesis check results with pass/fail status and metric values

Agent Management

  • GET /v1/agents?status=active — list active agents and their capabilities
  • GET /v1/agents/{agent_id}/active-faults — faults currently injected by a specific agent

Interview Tips

Key interview angles: how to prevent a chaos experiment from cascading into an actual outage (blast radius caps, mandatory dry-run mode, auto-halt, and a kill switch that revokes all active faults platform-wide); how to handle the case where the rollback itself fails (idempotent rollback with retry and alerting; the fence token pattern ensures a late rollback command does not re-inject a fault that was already cleaned up manually); and the gameday workflow (hypothesis definition before the experiment, not after, to prevent retrofitting the hypothesis to observed behavior). Discuss the difference between chaos engineering and fault injection testing: chaos engineering targets production with real traffic to discover unknown weaknesses, while fault injection testing targets staging with synthetic load to verify known failure modes.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the core fault injection primitives in a chaos engineering platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Core primitives include: latency injection (add artificial delay to network calls), error injection (return HTTP 5xx or exception at a configured rate), resource exhaustion (CPU spike, memory pressure, disk fill), network partition (drop or block traffic between service pairs), and process kill (terminate a pod or instance). Each primitive is parameterized by target selector, fault type, magnitude, and duration.”
}
},
{
“@type”: “Question”,
“name”: “How is blast radius scoped in a chaos engineering experiment?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Blast radius is constrained by targeting a specific subset of instances (e.g., 10% of pods in one AZ, one canary deployment, or a single user cohort via feature flag). The experiment definition requires an explicit target selector (labels, instance IDs, or traffic percentage). The platform rejects experiments that would affect more than a configured ceiling of production capacity, enforced at scheduling time.”
}
},
{
“@type”: “Question”,
“name”: “What is a steady-state hypothesis and how is it validated in chaos engineering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A steady-state hypothesis defines measurable conditions that must hold true before and after an experiment to confirm the system behaved normally: e.g., error rate < 0.1%, p99 latency < 300ms, health check returns 200. The platform probes these metrics before injecting faults (baseline), continuously during the experiment, and again after recovery. If post-experiment metrics don't return to baseline, the hypothesis fails and the system is considered not resilient to that fault class."
}
},
{
"@type": "Question",
"name": "How does auto-halt on SLO breach work during a chaos experiment?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The chaos platform continuously polls SLO indicators (error budget burn rate, latency percentiles, availability) during experiment execution. If any indicator crosses a pre-configured abort threshold, the platform immediately rolls back all injected faults, emits a halt event with the triggering metric and value, and locks the experiment from auto-rerunning. The abort threshold is typically set tighter than the SLO itself to provide a safety margin before real user impact occurs."
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top