Q: What types of failures can chaos engineering inject?

Network failures: packet loss, latency injection (100ms, 1s), network partition between specific services — using Linux tc (traffic control) or Toxiproxy. Process failures: kill containers or pods randomly (Chaos Monkey), trigger OOM kills, crash specific service instances. Resource exhaustion: CPU spike (cpu-burn tool), memory pressure (memory-hog), disk fill, file descriptor exhaustion. Dependency failures: make a downstream service return 50% 500 errors or add 500ms latency — validates circuit breakers and fallbacks. Data failures: corrupt a small percentage of writes, inject stale reads from a replica. Start with process failures (easiest to implement and recover from); graduate to network and data failures as the practice matures.

Q: What is a GameDay in chaos engineering?

A GameDay is a scheduled chaos experiment session where the on-call team runs planned failure scenarios and observes the system response in real-time. Format: 2-4 hours, pre-defined scenarios (AZ failure, primary database failover, CDN outage, traffic 3x spike), team on a video call with live dashboards open, run experiments one at a time, pause and discuss after each. The goal is dual: (1) test the system — does it handle the failure as designed? (2) test the team — can engineers detect the failure quickly? Do runbooks work? Is on-call tooling effective? GameDays build incident response muscle memory. Start GameDays in staging, run quarterly in production after the practice is mature.

Q: What observability must be in place before running chaos experiments in production?

Before injecting failures in production: (1) metrics — every service must emit request rate, error rate, and latency (the RED method); you need to see the impact of injected failures within 30 seconds; (2) distributed tracing — to follow a failure through the call graph and identify which service propagated the error; (3) structured logging — to correlate events across services during an incident; (4) alerting — alerts must fire when steady state is violated so you can abort the experiment if it goes too far; (5) abort mechanism — a kill switch that stops chaos injection immediately. If you cannot observe the impact of a failure quickly and reliably, you cannot safely run chaos experiments — fix observability first.

Question 1

What is the scientific method applied to chaos engineering?

Accepted Answer

Chaos engineering follows the scientific method: (1) define steady state — the measurable system behavior that indicates normal operation (p99 latency < 200ms, error rate < 0.1%); (2) form a hypothesis — 'We believe the payment service will continue processing orders if the inventory service is unavailable, because we have a circuit breaker'; (3) design the experiment — inject the failure (inventory service returning 503); (4) run and observe — monitor steady state metrics during injection; (5) analyze — did steady state hold? If yes, the hypothesis is validated. If no, a weakness is discovered. The result is either confidence (the system handles this failure) or a concrete bug to fix. Chaos engineering without a hypothesis is just random destruction.

Question 2

What types of failures can chaos engineering inject?

Accepted Answer

Network failures: packet loss, latency injection (100ms, 1s), network partition between specific services — using Linux tc (traffic control) or Toxiproxy. Process failures: kill containers or pods randomly (Chaos Monkey), trigger OOM kills, crash specific service instances. Resource exhaustion: CPU spike (cpu-burn tool), memory pressure (memory-hog), disk fill, file descriptor exhaustion. Dependency failures: make a downstream service return 50% 500 errors or add 500ms latency — validates circuit breakers and fallbacks. Data failures: corrupt a small percentage of writes, inject stale reads from a replica. Start with process failures (easiest to implement and recover from); graduate to network and data failures as the practice matures.

Question 3

What is a GameDay in chaos engineering?

Accepted Answer

A GameDay is a scheduled chaos experiment session where the on-call team runs planned failure scenarios and observes the system response in real-time. Format: 2-4 hours, pre-defined scenarios (AZ failure, primary database failover, CDN outage, traffic 3x spike), team on a video call with live dashboards open, run experiments one at a time, pause and discuss after each. The goal is dual: (1) test the system — does it handle the failure as designed? (2) test the team — can engineers detect the failure quickly? Do runbooks work? Is on-call tooling effective? GameDays build incident response muscle memory. Start GameDays in staging, run quarterly in production after the practice is mature.

Question 4

What observability must be in place before running chaos experiments in production?

Accepted Answer

Before injecting failures in production: (1) metrics — every service must emit request rate, error rate, and latency (the RED method); you need to see the impact of injected failures within 30 seconds; (2) distributed tracing — to follow a failure through the call graph and identify which service propagated the error; (3) structured logging — to correlate events across services during an incident; (4) alerting — alerts must fire when steady state is violated so you can abort the experiment if it goes too far; (5) abort mechanism — a kill switch that stops chaos injection immediately. If you cannot observe the impact of a failure quickly and reliably, you cannot safely run chaos experiments — fix observability first.

Chaos Engineering: Low-Level Design

The Chaos Engineering Hypothesis

Experiment Design

Define Steady State

Failure Injection Methods

Blast Radius Control

GameDay Practice

Observability Prerequisites

Tools