Chaos Engineering: Low-Level Design

Chaos engineering is the discipline of intentionally injecting failures into production systems to discover weaknesses before they cause outages. The premise: distributed systems fail in unexpected ways, and the only way to know how a system behaves under failure is to test it. A well-designed chaos engineering practice transforms unknown unknowns (failures you don’t know can happen) into known knowns (failure modes you understand and have mitigated).

The Chaos Engineering Hypothesis

Every chaos experiment starts with a hypothesis: “We believe that the payment service will continue to process orders even if the inventory service is unavailable, because we have implemented a circuit breaker and graceful degradation.” The experiment either validates the hypothesis (the system behaved as expected) or falsifies it (a new weakness is discovered). Chaos engineering is not random destruction — it is controlled scientific experimentation.

Experiment Design

Define Steady State

Before introducing chaos, define what “normal” looks like: p99 latency < 200ms, error rate 99.5%. These are the metrics you will monitor during the experiment. If the experiment causes steady state to deviate beyond acceptable bounds, the abort condition triggers and chaos injection stops. Steady state metrics come from your monitoring system — if you don’t have reliable metrics, run chaos engineering last (fix observability first).

Failure Injection Methods

Network failures: packet loss, latency injection, network partition between specific services. Tools: tc (Linux traffic control), Toxiproxy (proxy that adds configurable network conditions), AWS Fault Injection Simulator.

Process failures: kill a process, crash a container, trigger OOM. Kubernetes: delete pods randomly (Chaos Monkey, Chaos Mesh). Simulate: unavailable service instances during rolling deployments.

Resource exhaustion: CPU spike (CPU-burn), memory pressure (memory-hog), disk full, file descriptor exhaustion. Identifies services that fail ungracefully under resource pressure.

Dependency failures: inject errors into database calls (500ms latency, 50% error rate), make a third-party API return 503. Validates that circuit breakers and fallbacks work as designed.

Blast Radius Control

Start small and expand: begin with a single container in a staging environment, then a single availability zone in production, then multiple AZs. Always define the blast radius before starting: which services are affected, what percentage of traffic is impacted, what the abort condition is. Never run chaos experiments on your primary database without extensive preparation. Use feature flags to enable/disable chaos injection without deployment.

GameDay Practice

A GameDay is a scheduled chaos experiment where the on-call team runs planned failure scenarios and observes the system’s response. Format: 2-4 hours, pre-defined scenarios (AZ failure, dependency outage, traffic spike), live monitoring, team observes and responds as if it were a real incident. The goal is not just to test the system but to test the team’s incident response: Can they detect the failure? How quickly? Do runbooks work? Is on-call tooling effective? GameDays build muscle memory and expose gaps in both systems and processes.

Observability Prerequisites

Chaos engineering without observability is dangerous. Before running experiments: ensure every service emits metrics (request rate, error rate, latency), distributed tracing is enabled (to trace failures through the call graph), structured logging is in place (to correlate events), and alerting works (to detect when steady state is violated). If you inject a failure and cannot observe its impact in the metrics within 30 seconds, your observability is insufficient for chaos engineering.

Tools

Netflix Chaos Monkey: randomly terminates EC2 instances during business hours. Chaos Mesh: Kubernetes-native chaos — pod kill, network delay, IO fault injection via CRDs. AWS Fault Injection Simulator: managed service for EC2, RDS, EKS failures with rollback. Gremlin: commercial platform with experiment templates and blast radius controls. Toxiproxy: proxy for simulating network conditions in testing environments. Start with Toxiproxy in staging; graduate to Chaos Mesh in production as your practice matures.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top