Chaos engineering is the discipline of intentionally injecting failures into production systems to discover weaknesses before they cause outages. The premise: distributed systems fail in unexpected ways, and the only way to know how a system behaves under failure is to test it. A well-designed chaos engineering practice transforms unknown unknowns (failures you don’t know can happen) into known knowns (failure modes you understand and have mitigated).
The Chaos Engineering Hypothesis
Every chaos experiment starts with a hypothesis: “We believe that the payment service will continue to process orders even if the inventory service is unavailable, because we have implemented a circuit breaker and graceful degradation.” The experiment either validates the hypothesis (the system behaved as expected) or falsifies it (a new weakness is discovered). Chaos engineering is not random destruction — it is controlled scientific experimentation.
Experiment Design
Define Steady State
Before introducing chaos, define what “normal” looks like: p99 latency < 200ms, error rate 99.5%. These are the metrics you will monitor during the experiment. If the experiment causes steady state to deviate beyond acceptable bounds, the abort condition triggers and chaos injection stops. Steady state metrics come from your monitoring system — if you don’t have reliable metrics, run chaos engineering last (fix observability first).
Failure Injection Methods
Network failures: packet loss, latency injection, network partition between specific services. Tools: tc (Linux traffic control), Toxiproxy (proxy that adds configurable network conditions), AWS Fault Injection Simulator.
Process failures: kill a process, crash a container, trigger OOM. Kubernetes: delete pods randomly (Chaos Monkey, Chaos Mesh). Simulate: unavailable service instances during rolling deployments.
Resource exhaustion: CPU spike (CPU-burn), memory pressure (memory-hog), disk full, file descriptor exhaustion. Identifies services that fail ungracefully under resource pressure.
Dependency failures: inject errors into database calls (500ms latency, 50% error rate), make a third-party API return 503. Validates that circuit breakers and fallbacks work as designed.
Blast Radius Control
Start small and expand: begin with a single container in a staging environment, then a single availability zone in production, then multiple AZs. Always define the blast radius before starting: which services are affected, what percentage of traffic is impacted, what the abort condition is. Never run chaos experiments on your primary database without extensive preparation. Use feature flags to enable/disable chaos injection without deployment.
GameDay Practice
A GameDay is a scheduled chaos experiment where the on-call team runs planned failure scenarios and observes the system’s response. Format: 2-4 hours, pre-defined scenarios (AZ failure, dependency outage, traffic spike), live monitoring, team observes and responds as if it were a real incident. The goal is not just to test the system but to test the team’s incident response: Can they detect the failure? How quickly? Do runbooks work? Is on-call tooling effective? GameDays build muscle memory and expose gaps in both systems and processes.
Observability Prerequisites
Chaos engineering without observability is dangerous. Before running experiments: ensure every service emits metrics (request rate, error rate, latency), distributed tracing is enabled (to trace failures through the call graph), structured logging is in place (to correlate events), and alerting works (to detect when steady state is violated). If you inject a failure and cannot observe its impact in the metrics within 30 seconds, your observability is insufficient for chaos engineering.
Tools
Netflix Chaos Monkey: randomly terminates EC2 instances during business hours. Chaos Mesh: Kubernetes-native chaos — pod kill, network delay, IO fault injection via CRDs. AWS Fault Injection Simulator: managed service for EC2, RDS, EKS failures with rollback. Gremlin: commercial platform with experiment templates and blast radius controls. Toxiproxy: proxy for simulating network conditions in testing environments. Start with Toxiproxy in staging; graduate to Chaos Mesh in production as your practice matures.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide
See also: Coinbase Interview Guide
See also: Shopify Interview Guide
See also: Snap Interview Guide
See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems