Question 1

How does Netflix Chaos Monkey work and what does it test?

Accepted Answer

Chaos Monkey is a service that randomly terminates virtual machine instances in the Netflix production environment during business hours. It runs on a configurable schedule (typically every weekday during US business hours) and selects random instances from enabled application groups. When an instance is terminated, the application must continue serving traffic using its remaining instances. If the application cannot handle the loss of one instance (load balancer does not detect the failure, auto-scaling does not replace it, remaining instances cannot handle the redistributed load), this reveals a resilience gap. Chaos Monkey tests three things: (1) Auto-scaling -- does the system detect the missing instance and launch a replacement? (2) Load balancing -- does the load balancer detect the unhealthy instance and stop routing traffic to it? (3) Statelessness -- can the application function correctly with one fewer instance? If the terminated instance held critical state (session data, cache), the application should degrade gracefully, not fail. Netflix runs Chaos Monkey continuously, not as a periodic test. Every Netflix service is designed from day one with the assumption that any instance can be terminated at any time. This is a cultural practice as much as a technical tool.

Question 2

What is the difference between chaos engineering and traditional testing?

Accepted Answer

Traditional testing (unit, integration, end-to-end) verifies that the system works correctly under expected conditions. You define inputs and expected outputs. Tests are deterministic and reproducible. Chaos engineering tests the system behavior under unexpected conditions -- conditions that are difficult or impossible to reproduce in a test environment. Key differences: (1) Scope -- traditional tests verify individual components or interactions. Chaos experiments test the entire system including infrastructure, networking, and dependencies. (2) Environment -- traditional tests run in isolated test environments. Chaos experiments run in production (or production-like staging) to capture real-world complexity. (3) Hypothesis -- traditional tests assert specific outcomes. Chaos experiments test a hypothesis about steady-state behavior (the system will maintain 99.9% availability when one database replica fails). (4) Discovery -- traditional tests verify known requirements. Chaos experiments discover unknown failure modes (cascading failures, race conditions, capacity limits) that were not anticipated during design. (5) Frequency -- traditional tests run on every commit. Chaos experiments run periodically or continuously depending on the organization maturity level. The two approaches are complementary, not competing. A system needs both traditional tests and chaos experiments.

Question 3

How do you safely run chaos experiments in production without causing outages?

Accepted Answer

Safe chaos experiment execution requires: (1) Define abort criteria before starting -- specific metric thresholds that trigger immediate experiment termination. Example: if error rate exceeds 2% or P99 latency exceeds 5 seconds, abort immediately. Automate the abort: the chaos tool should monitor metrics and stop the experiment if thresholds are breached. (2) Start with the smallest blast radius -- inject faults into a single instance, a single availability zone, or a small percentage of traffic. Never start with region-level or full-cluster experiments. (3) Run during business hours when the team is available to respond -- not at 3 AM. If the experiment reveals a real problem, engineers are awake and available to mitigate. (4) Have a rollback plan -- know exactly how to stop the experiment and restore normal conditions. For Kubernetes pod kills, the scheduler replaces the pod. For network faults injected via tc or iptables, have the removal command ready. (5) Communicate -- notify the team before running the experiment. Post in the engineering Slack channel. If the experiment causes visible impact, the on-call engineer should know it is an experiment, not a real incident. (6) Document results -- whether the system passed or failed, document what was tested, what was observed, and what follow-up actions are needed.

Question 4

What are game days and how do you run one effectively?

Accepted Answer

A game day is a structured, team-wide chaos experiment with a specific learning objective. Unlike automated chaos (Chaos Monkey running continuously), game days are planned events where the team gathers to observe system behavior under controlled failure conditions. Running an effective game day: (1) Planning (1-2 weeks before) -- define the hypothesis to test, select the failure to inject, set blast radius limits and abort criteria, ensure monitoring dashboards are ready, schedule the experiment during business hours with all relevant engineers available. (2) Pre-experiment briefing (30 minutes before) -- review the plan with the team, assign roles (experiment operator, metric observer, incident commander if things go wrong), confirm abort procedures. (3) Execution (30-60 minutes) -- inject the failure, observe system behavior in real-time on dashboards, note what alerts fire and when, observe how automated systems respond (auto-scaling, circuit breakers, retries). (4) Post-experiment review (immediately after) -- was the hypothesis confirmed or disproved? What worked well? What gaps were discovered? Create action items for fixes. (5) Follow-up -- track action items to completion. Schedule the next game day to verify fixes. Run game days monthly for critical services. Quarterly is sufficient for lower-risk services.

System Design: Chaos Engineering — Netflix Chaos Monkey, Fault Injection, Game Days, Resilience Testing, Blast Radius

Why Chaos Engineering Exists

Chaos Engineering Principles

Netflix Chaos Monkey and the Simian Army

Fault Injection Techniques

Game Days: Structured Chaos Experiments

Chaos Engineering Tools

Building a Chaos Engineering Practice