Skip to content

Network Chaos & Failure Simulation Tool

DifficultyAdvanced
Team Size3-5 people
Time~30-40 hours
Demo-ready byStep 5
PrerequisitesNode.js, Docker, Linux networking, distributed systems
Built byChaos Monkey (Netflix), Gremlin, Litmus, Pumba

Skills you'll earn: Failure injection, network simulation (tc/iptables), container orchestration, resilience metrics, experiment design

Start by killing a container on command. End with a chaos engineering platform that injects failures, simulates network conditions, and measures system resilience.

Step 1: Kill a container on command (~2-3 hours)

The simplest chaos: make something die.

  • Build a CLI tool that takes a container name and kills it via the Docker API
  • Verify the container is gone: docker ps should not list it
  • Observe what happens to the system that depended on it
  • Log the kill: timestamp, target, method

You now have: Manual failure injection.

Step 2: Random pod/container killing (~3-4 hours)

Killing a known target is predictable. Real failures are not.

  • List all running containers, pick one at random, kill it
  • Run this on a schedule (every N minutes) or as a one-shot
  • Add filters: only target containers with a specific label (opt-in chaos)
  • Avoid killing infrastructure containers (the chaos tool itself, the database)

You now have: Random failure injection.

Step 3: Network latency injection (~3-4 hours)

Not all failures are crashes. Slow networks are worse — they cause timeouts and cascading retries.

  • Use tc (traffic control) to add latency to a container's network interface
  • Configure: target container, latency (ms), jitter, duration
  • Also support packet loss: drop N% of packets
  • Remove the rules after the experiment ends

You now have: Network degradation simulation.

Step 4: Network partitions (~3-4 hours)

Two services can't reach each other. Does the system handle it?

  • Use iptables to block traffic between specific containers
  • Simulate a split-brain: partition the cluster into two groups that can't communicate
  • Hold the partition for a configurable duration, then heal it
  • Log the partition: which services were isolated, for how long

You now have: Network partition simulation.

Step 5: Experiment definitions and scheduling (~3-4 hours)

Ad-hoc chaos is useful for exploration. Repeated experiments need structure.

  • Define experiments in YAML: target, failure type, parameters, duration, steady-state hypothesis
  • Steady-state hypothesis: "the API returns 200 within 500ms" — checked before and after
  • If the hypothesis holds after chaos, the system is resilient. If not, you found a bug.
  • Schedule experiments: daily, weekly, or on-demand

You now have: Structured chaos experiments.

Step 6: Dashboard and reporting (~3-4 hours)

  • Web UI showing experiment history, results (pass/fail), and affected services
  • Real-time view during an experiment: what's broken, system metrics
  • Link to Grafana dashboards for correlated metrics

Step 7: Blast radius control (~3-4 hours)

  • Abort experiments automatically if error rates exceed a safety threshold
  • Scope experiments to non-production environments first
  • Gradual rollout: start with 1% traffic affected, increase

Useful Resources

Where to go from here

  • CPU and memory stress injection (exhaust resources without killing)
  • Disk I/O chaos (slow reads/writes, fill disk)
  • DNS failure simulation (return NXDOMAIN for specific services)
  • GameDay facilitation (guided chaos exercises for teams)
  • Integration with your CI/CD pipeline (run chaos tests before deploy)