Chaos Engineering: Low-Level Design – Tech Interview Dot Org

Chaos engineering is the discipline of intentionally injecting failures into production systems to discover weaknesses before they cause outages. The premise: distributed systems fail in unexpected ways, and the only way to know how a system behaves under failure is to test it. A well-designed chaos engineering practice transforms unknown unknowns (failures you don’t know can happen) into known knowns (failure modes you understand and have mitigated).

The Chaos Engineering Hypothesis

Every chaos experiment starts with a hypothesis: “We believe that the payment service will continue to process orders even if the inventory service is unavailable, because we have implemented a circuit breaker and graceful degradation.” The experiment either validates the hypothesis (the system behaved as expected) or falsifies it (a new weakness is discovered). Chaos engineering is not random destruction — it is controlled scientific experimentation.

Experiment Design

Define Steady State

Before introducing chaos, define what “normal” looks like: p99 latency < 200ms, error rate 99.5%. These are the metrics you will monitor during the experiment. If the experiment causes steady state to deviate beyond acceptable bounds, the abort condition triggers and chaos injection stops. Steady state metrics come from your monitoring system — if you don’t have reliable metrics, run chaos engineering last (fix observability first).

Failure Injection Methods

Network failures: packet loss, latency injection, network partition between specific services. Tools: tc (Linux traffic control), Toxiproxy (proxy that adds configurable network conditions), AWS Fault Injection Simulator.

Process failures: kill a process, crash a container, trigger OOM. Kubernetes: delete pods randomly (Chaos Monkey, Chaos Mesh). Simulate: unavailable service instances during rolling deployments.

Resource exhaustion: CPU spike (CPU-burn), memory pressure (memory-hog), disk full, file descriptor exhaustion. Identifies services that fail ungracefully under resource pressure.

Dependency failures: inject errors into database calls (500ms latency, 50% error rate), make a third-party API return 503. Validates that circuit breakers and fallbacks work as designed.

Blast Radius Control

Start small and expand: begin with a single container in a staging environment, then a single availability zone in production, then multiple AZs. Always define the blast radius before starting: which services are affected, what percentage of traffic is impacted, what the abort condition is. Never run chaos experiments on your primary database without extensive preparation. Use feature flags to enable/disable chaos injection without deployment.

GameDay Practice

A GameDay is a scheduled chaos experiment where the on-call team runs planned failure scenarios and observes the system’s response. Format: 2-4 hours, pre-defined scenarios (AZ failure, dependency outage, traffic spike), live monitoring, team observes and responds as if it were a real incident. The goal is not just to test the system but to test the team’s incident response: Can they detect the failure? How quickly? Do runbooks work? Is on-call tooling effective? GameDays build muscle memory and expose gaps in both systems and processes.

Observability Prerequisites

Chaos engineering without observability is dangerous. Before running experiments: ensure every service emits metrics (request rate, error rate, latency), distributed tracing is enabled (to trace failures through the call graph), structured logging is in place (to correlate events), and alerting works (to detect when steady state is violated). If you inject a failure and cannot observe its impact in the metrics within 30 seconds, your observability is insufficient for chaos engineering.

Tools

Netflix Chaos Monkey: randomly terminates EC2 instances during business hours. Chaos Mesh: Kubernetes-native chaos — pod kill, network delay, IO fault injection via CRDs. AWS Fault Injection Simulator: managed service for EC2, RDS, EKS failures with rollback. Gremlin: commercial platform with experiment templates and blast radius controls. Toxiproxy: proxy for simulating network conditions in testing environments. Start with Toxiproxy in staging; graduate to Chaos Mesh in production as your practice matures.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the scientific method applied to chaos engineering?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Chaos engineering follows the scientific method: (1) define steady state — the measurable system behavior that indicates normal operation (p99 latency < 200ms, error rate < 0.1%); (2) form a hypothesis — 'We believe the payment service will continue processing orders if the inventory service is unavailable, because we have a circuit breaker'; (3) design the experiment — inject the failure (inventory service returning 503); (4) run and observe — monitor steady state metrics during injection; (5) analyze — did steady state hold? If yes, the hypothesis is validated. If no, a weakness is discovered. The result is either confidence (the system handles this failure) or a concrete bug to fix. Chaos engineering without a hypothesis is just random destruction."} }, { "@type": "Question", "name": "What types of failures can chaos engineering inject?", "acceptedAnswer": {"@type": "Answer", "text": "Network failures: packet loss, latency injection (100ms, 1s), network partition between specific services — using Linux tc (traffic control) or Toxiproxy. Process failures: kill containers or pods randomly (Chaos Monkey), trigger OOM kills, crash specific service instances. Resource exhaustion: CPU spike (cpu-burn tool), memory pressure (memory-hog), disk fill, file descriptor exhaustion. Dependency failures: make a downstream service return 50% 500 errors or add 500ms latency — validates circuit breakers and fallbacks. Data failures: corrupt a small percentage of writes, inject stale reads from a replica. Start with process failures (easiest to implement and recover from); graduate to network and data failures as the practice matures."} }, { "@type": "Question", "name": "What is a GameDay in chaos engineering?", "acceptedAnswer": {"@type": "Answer", "text": "A GameDay is a scheduled chaos experiment session where the on-call team runs planned failure scenarios and observes the system response in real-time. Format: 2-4 hours, pre-defined scenarios (AZ failure, primary database failover, CDN outage, traffic 3x spike), team on a video call with live dashboards open, run experiments one at a time, pause and discuss after each. The goal is dual: (1) test the system — does it handle the failure as designed? (2) test the team — can engineers detect the failure quickly? Do runbooks work? Is on-call tooling effective? GameDays build incident response muscle memory. Start GameDays in staging, run quarterly in production after the practice is mature."} }, { "@type": "Question", "name": "What observability must be in place before running chaos experiments in production?", "acceptedAnswer": {"@type": "Answer", "text": "Before injecting failures in production: (1) metrics — every service must emit request rate, error rate, and latency (the RED method); you need to see the impact of injected failures within 30 seconds; (2) distributed tracing — to follow a failure through the call graph and identify which service propagated the error; (3) structured logging — to correlate events across services during an incident; (4) alerting — alerts must fire when steady state is violated so you can abort the experiment if it goes too far; (5) abort mechanism — a kill switch that stops chaos injection immediately. If you cannot observe the impact of a failure quickly and reliably, you cannot safely run chaos experiments — fix observability first."} } ] }