Chaos Engineering: Building Resilient Distributed Systems

Introduction

Distributed systems fail in unpredictable ways. Network partitions, resource exhaustion, cascading failures, and slow dependencies break assumptions baked into your code. Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. Unlike traditional testing, which validates known paths, chaos engineering surfaces unknown failure modes through controlled, hypothesis-driven experiments.

Netflix pioneered this approach with Chaos Monkey in 2011. Since then the practice has matured into a formal engineering discipline with tooling, principles, and production-proven workflows.

Core Principles

Chaos engineering rests on a set of principles formalized by Tammy Butow, Casey Rosenthal, and others at Netflix — commonly called Bishop’s Principles:

Define steady state — measurable metrics that represent normal behavior
Hypothesize about steady state — predict the system will remain stable under perturbation
Introduce realistic failures — simulate events that happen in production
Minimize blast radius — start small, verify, then expand
Automate experiments — run continuously, not as one-off drills

The steady-state hypothesis is the central idea: you define what “healthy” looks like, introduce a failure, and measure whether the system maintains that health. If it does, your resilience hypothesis holds. If not, you found a weakness.

Steady-State Hypothesis in Practice

Define system health with concrete metrics before any experiment:

STEADY_STATE = {
    "availability": {"p50": 0.999, "p99": 0.995},
    "p99_latency_ms": {"api": 200, "db": 50, "cache": 5},
    "error_rate": {"5xx": 0.0, "4xx_above_baseline": 0.0},
    "throughput_rps": {"min": 1000, "max": 5000},
}

def steady_state_healthy(metrics: dict) -> bool:
    for service, threshold in STEADY_STATE["p99_latency_ms"].items():
        if metrics.get(f"latency.{service}", 0) > threshold:
            return False
    if metrics.get("error_rate.5xx", 0) > 0:
        return False
    return True

Why Chaos Engineering Matters

Traditional testing confirms known behavior. Unit tests verify functions. Integration tests verify service contracts. Load tests verify throughput under expected load. None of these validate how the system behaves when a dependency silently degrades, a pod crashes mid-request, or a DNS resolution randomly fails.

Chaos engineering reveals:

Weak or missing fallbacks — does the app handle timeouts or just hang?
Incorrect retry logic — retry storms kill systems faster than the original failure
Brittle configuration — hardcoded endpoints, tight timeouts, missing circuit breakers
Monitoring blind spots — if you can’t see the failure, you can’t respond

Production incidents follow a pattern: a single fault cascades through hidden dependencies. Chaos experiments reproduce that cascade in a controlled way so you can fix the root causes proactively.

Experiment Lifecycle

Every chaos experiment follows a six-stage lifecycle:

Plan — define the hypothesis and steady-state metrics
Design — choose failure type, scope, duration, and blast radius
Approve — review and schedule; production experiments need sign-off
Execute — inject the failure, monitor in real time
Analyze — compare observed state against steady-state hypothesis
Remediate — fix gaps, update runbooks, expand experiment scope

from enum import Enum
from dataclasses import dataclass
from datetime import datetime

class ExperimentStatus(Enum):
    PLANNED = "planned"
    RUNNING = "running"
    PASSED = "passed"
    FAILED = "failed"
    ABORTED = "aborted"

@dataclass
class ChaosExperiment:
    name: str
    hypothesis: str
    steady_state: dict
    target: str
    failure_type: str
    duration_seconds: int
    blast_radius: str
    status: ExperimentStatus = ExperimentStatus.PLANNED
    started_at: datetime = None
    ended_at: datetime = None

    def run(self, metrics_provider):
        self.status = ExperimentStatus.RUNNING
        self.started_at = datetime.now()
        baseline = metrics_provider.snapshot()
        inject_failure(self.target, self.failure_type, self.duration_seconds)
        observed = metrics_provider.snapshot()
        self.ended_at = datetime.now()
        self.status = (
            ExperimentStatus.PASSED
            if self._hypothesis_holds(baseline, observed)
            else ExperimentStatus.FAILED
        )
        return self.status

    def _hypothesis_holds(self, baseline, observed):
        for metric, threshold in self.steady_state.items():
            if observed.get(metric, 0) > threshold:
                return False
        return True

Chaos Engineering Tools

Tool	Type	Installation	Kubernetes	Production Ready	Steady-State Verification
Chaos Monkey	Instance termination	Spinnaker-integrated	Limited	Yes	Manual
LitmusChaos	Cloud-native CRD	Helm chart	Native	Yes	Probes & metrics
Chaos Mesh	Cloud-native CRD	Helm chart / Operator	Native	Yes	Stress + network + IO
Gremlin	SaaS + agent	Agent install	Yes	Yes	Built-in health checks
AWS FIS	AWS managed	Console / SDK	EKS only	Yes	CloudWatch integration
Toxiproxy	TCP proxy	Binary / Docker	Sidecar	Staging only	None

LitmusChaos

LitmusChaos uses Kubernetes Custom Resource Definitions to define experiments. It runs as a set of controllers and operators inside the cluster.

# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-delete
  namespace: chaos
spec:
  appinfo:
    appns: payment
    applabel: "app=payment-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  monitoring:
    enabled: true
  experimentRun:
    repeat: false
    duration: 60
  experiments:
  - name: pod-delete
    spec:
      probe:
      - name: check-payment-availability
        type: httpProbe
        httpProbe/inputs:
          url: "http://payment-service:8080/health"
          insecure: true
          response:
            criteria: ==
            statusCode: 200
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'true'
        - name: TARGET_PODS
          value: '1'
        - name: PODS_AFFECTED_PERC
          value: '50'

Chaos Mesh

Chaos Mesh provides fine-grained fault types including network partition, pod kill, IO delay, and DNS chaos.

# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-to-db-partition
  namespace: chaos
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - payment
    labelSelectors:
      app: payment-service
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - database
      labelSelectors:
        app: postgres
  duration: 30s
  scheduler:
    cron: "@every 10m"

Gremlin (via API)

Gremlin offers a SaaS-managed platform with a CLI and API for fault injection.

# Attack a Kubernetes deployment with CPU exhaustion
$ gremlin attack kubernetes \
  --deployment payment-service \
  --cpu \
  --cpu-core-count 2 \
  --cpu-capacity-percent 80 \
  --length 60

# Attack with network latency
$ gremlin attack kubernetes \
  --deployment payment-service \
  --container payment \
  --latency \
  --latency-ms 500 \
  --length 30

AWS Fault Injection Simulator

AWS FIS integrates with CloudWatch alarms for automatic rollback.

# aws-fis-experiment.yaml
Description: "Terminate EC2 instance in ASG"
Targets:
  Instances:
    ResourceType: aws:ec2:instance
    SelectionMode: COUNT(1)
    ResourceTags:
      Environment: production
Actions:
  terminateInstances:
    ActionId: aws:ec2:terminate-instances
    Parameters:
      duration: PT1M
StopConditions:
  - Source: aws:cloudwatch:alarm
    Value: "payment-error-rate-high"

Blast Radius Control

Blast radius is the scope of damage an experiment can cause if it goes wrong. Control it with layered constraints:

Scope — target a single pod, not a deployment; a single AZ, not a region
Duration — short experiments (30-60 seconds) with automatic rollback
Health checks — abort the experiment if key metrics cross thresholds
User impact — target only during low-traffic windows for production experiments

Blast Radius Configuration

# blast-radius-config.yaml
blastRadius:
  maxTargets: 1
  maxPercentage: 5
  allowedNamespaces:
    - staging
    - chaos
  blockedNamespaces:
    - production-critical
  autoAbort:
    enabled: true
    conditions:
      - metric: error_rate
        threshold: 0.01
        window: 30s
      - metric: p99_latency
        threshold: 1000
        window: 30s

def should_abort(metrics: dict, guards: list) -> bool:
    for guard in guards:
        value = metrics.get(guard["metric"], 0)
        if value > guard["threshold"]:
            print(f"[ABORT] {guard['metric']} = {value} > {guard['threshold']}")
            return True
    return False

Automated vs Manual Experiments

Aspect	Manual	Automated
Frequency	Monthly / per-release	Continuous (every hour / every deploy)
Approval	Human-in-loop	GitOps / policy-driven
Scope	Broad, exploratory	Narrow, regression
Response	Engineer investigates	Auto-remediate or page
Tooling	Ad-hoc scripts	GameDay platform (Litmus, Chaos Mesh)

Automated experiments shine for known failure modes. Run pod-delete chaos on every deploy to verify your service recovers without intervention. Reserve manual experiments for novel failure scenarios where you need human analysis.

# Run automated chaos suite after each deployment
$ litmusctl create chaos-delegate --mode cluster
$ litmusctl create chaos-workflow \
  --name post-deploy-chaos \
  --experiment-file experiments/pod-delete.yaml

Production vs Staging

Staging chaos validates that your code handles failures. Production chaos validates that your system handles failures — including real traffic patterns, real data volumes, and real network conditions.

Staging-first progression:

Run experiments in a dedicated staging environment
Move to a canary namespace in production (shadow traffic only)
Run on a single non-critical service in production
Expand after each step passes

Never run experiments in production without:

A documented hypothesis and expected outcome
Health check probes that can abort the experiment
A runbook for manual rollback
On-call engineer notified before execution
Time-bounded experiment with enforced max duration

Observability Integration

Chaos experiments are useless without observability. You need real-time visibility into the system to validate steady state and detect anomalies.

# prometheus-metrics-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: chaos-exporter
  endpoints:
  - port: metrics
    interval: 5s
  namespaceSelector:
    matchNames:
      - chaos
      - staging

import time
from prometheus_api_client import PrometheusConnect

PROM = PrometheusConnect(url="http://prometheus:9090", disable_ssl=True)

def monitor_experiment(experiment_name: str, duration: int, interval: int = 5):
    start = time.time()
    while time.time() - start < duration:
        error_rate = PROM.custom_query(
            'sum(rate(http_requests_total{status=~"5.."}[1m]))'
        )
        latency = PROM.custom_query(
            'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])))'
        )
        print(f"[{experiment_name}] error_rate={error_rate}, p99_latency={latency}")
        if float(error_rate[0]["value"][1]) > 0.01:
            print("[ALERT] Error rate exceeded threshold — aborting")
            return False
        time.sleep(interval)
    return True

Rollback Strategies

Every chaos experiment needs a rollback plan, even when the hypothesis is expected to hold.

Strategy	Mechanism	Speed	Complexity
Time-bound	Experiment self-terminates after T seconds	Instant	None
Metric-based	Abort when error rate / latency crosses threshold	Seconds	Medium
Manual halt	Engineer issues stop command	Seconds to minutes	Low
Git revert	Revert experiment config in GitOps pipeline	Minutes	High

#!/bin/bash
# rollback-chaos.sh — halt all active experiments
set -e

echo "[ROLLBACK] Stopping all chaos experiments..."

# LitmusChaos
kubectl delete chaosengine --all -n chaos 2>/dev/null || true

# Chaos Mesh
kubectl delete networkchaos --all -n chaos 2>/dev/null || true
kubectl delete podchaos --all -n chaos 2>/dev/null || true
kubectl delete stresschaos --all -n chaos 2>/dev/null || true

echo "[ROLLBACK] Verifying system health..."
kubectl get pods -l app=payment-service -n payment

echo "[ROLLBACK] Complete"

Building a Chaos Experiment Pipeline

A mature chaos practice integrates into CI/CD. Here is a GitHub Actions workflow that runs chaos experiments after staging deployment:

# .github/workflows/chaos-pipeline.yaml
name: Post-Deploy Chaos
on:
  deployment_status:
    states: [success]
jobs:
  chaos:
    runs-on: ubuntu-latest
    if: github.event.deployment.environment == 'staging'
    steps:
    - name: Install Litmus CLI
      run: |
        curl -sSL https://litmusctl.sh | bash
    - name: Run pod-delete experiment
      run: |
        litmusctl create chaos-workflow \
          --name "post-deploy-$(git rev-parse --short HEAD)" \
          --experiment-file experiments/pod-delete.yaml \
          --namespace chaos
    - name: Wait for experiment completion
      run: |
        sleep 60
        kubectl get chaosresult -n chaos -o json \
          | jq '.items[].spec.experimentStatus.verdict'
    - name: Notify on failure
      if: failure()
      run: |
        curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
          -H 'Content-Type: application/json' \
          -d '{"text": "Chaos experiment FAILED for staging deploy"}'

Writing a Custom Fault Injection in Go

For failure types not covered by existing tools, write a lightweight injector:

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
	"time"
)

func failHandler(w http.ResponseWriter, r *http.Request) {
	http.Error(w, "injected failure", http.StatusInternalServerError)
}

func latencyHandler(w http.ResponseWriter, r *http.Request) {
	time.Sleep(2 * time.Second)
	io.WriteString(w, "delayed response")
}

func main() {
	failureType := os.Getenv("FAILURE_TYPE")
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}

	mux := http.NewServeMux()
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		io.WriteString(w, "ok")
	})

	switch failureType {
	case "http-500":
		mux.HandleFunc("/api/", failHandler)
	case "latency":
		mux.HandleFunc("/api/", latencyHandler)
	default:
		mux.HandleFunc("/api/", func(w http.ResponseWriter, r *http.Request) {
			io.WriteString(w, "normal response")
		})
	}

	fmt.Printf("Injector running on :%s with failure=%s\n", port, failureType)
	http.ListenAndServe(":"+port, mux)
}

Deploy as a sidecar or a separate service, then route traffic through it to simulate targeted failures.

Conclusion

Chaos engineering is not about breaking things — it is about proving your system can survive when things break. Start with a single hypothesis, a small blast radius, and a clear steady-state definition. Automate experiments that validate known failure modes. Integrate observability so you can see what happens. Move to production only after staging validation builds confidence.

The goal is not zero failures. The goal is zero surprise failures.

Chaos Engineering: Building Resilient Distributed Systems

Introduction

Core Principles

Steady-State Hypothesis in Practice

Why Chaos Engineering Matters

Experiment Lifecycle

Chaos Engineering Tools

LitmusChaos

Chaos Mesh

Gremlin (via API)

AWS Fault Injection Simulator

Blast Radius Control

Blast Radius Configuration

Automated vs Manual Experiments

Production vs Staging

Observability Integration

Rollback Strategies

Building a Chaos Experiment Pipeline

Writing a Custom Fault Injection in Go

Conclusion

Resources

Comments

Share this article

👍 Was this article helpful?