Skip to main content
โšก Calmops

Chaos Engineering: Building Resilient Distributed Systems

Introduction

Distributed systems fail in unpredictable ways. Network partitions, resource exhaustion, cascading failures, and slow dependencies break assumptions baked into your code. Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. Unlike traditional testing, which validates known paths, chaos engineering surfaces unknown failure modes through controlled, hypothesis-driven experiments.

Netflix pioneered this approach with Chaos Monkey in 2011. Since then the practice has matured into a formal engineering discipline with tooling, principles, and production-proven workflows.

Core Principles

Chaos engineering rests on a set of principles formalized by Tammy Butow, Casey Rosenthal, and others at Netflix โ€” commonly called Bishop’s Principles:

  1. Define steady state โ€” measurable metrics that represent normal behavior
  2. Hypothesize about steady state โ€” predict the system will remain stable under perturbation
  3. Introduce realistic failures โ€” simulate events that happen in production
  4. Minimize blast radius โ€” start small, verify, then expand
  5. Automate experiments โ€” run continuously, not as one-off drills

The steady-state hypothesis is the central idea: you define what “healthy” looks like, introduce a failure, and measure whether the system maintains that health. If it does, your resilience hypothesis holds. If not, you found a weakness.

Steady-State Hypothesis in Practice

Define system health with concrete metrics before any experiment:

STEADY_STATE = {
    "availability": {"p50": 0.999, "p99": 0.995},
    "p99_latency_ms": {"api": 200, "db": 50, "cache": 5},
    "error_rate": {"5xx": 0.0, "4xx_above_baseline": 0.0},
    "throughput_rps": {"min": 1000, "max": 5000},
}

def steady_state_healthy(metrics: dict) -> bool:
    for service, threshold in STEADY_STATE["p99_latency_ms"].items():
        if metrics.get(f"latency.{service}", 0) > threshold:
            return False
    if metrics.get("error_rate.5xx", 0) > 0:
        return False
    return True

Why Chaos Engineering Matters

Traditional testing confirms known behavior. Unit tests verify functions. Integration tests verify service contracts. Load tests verify throughput under expected load. None of these validate how the system behaves when a dependency silently degrades, a pod crashes mid-request, or a DNS resolution randomly fails.

Chaos engineering reveals:

  • Weak or missing fallbacks โ€” does the app handle timeouts or just hang?
  • Incorrect retry logic โ€” retry storms kill systems faster than the original failure
  • Brittle configuration โ€” hardcoded endpoints, tight timeouts, missing circuit breakers
  • Monitoring blind spots โ€” if you can’t see the failure, you can’t respond

Production incidents follow a pattern: a single fault cascades through hidden dependencies. Chaos experiments reproduce that cascade in a controlled way so you can fix the root causes proactively.

Experiment Lifecycle

Every chaos experiment follows a six-stage lifecycle:

  1. Plan โ€” define the hypothesis and steady-state metrics
  2. Design โ€” choose failure type, scope, duration, and blast radius
  3. Approve โ€” review and schedule; production experiments need sign-off
  4. Execute โ€” inject the failure, monitor in real time
  5. Analyze โ€” compare observed state against steady-state hypothesis
  6. Remediate โ€” fix gaps, update runbooks, expand experiment scope
from enum import Enum
from dataclasses import dataclass
from datetime import datetime

class ExperimentStatus(Enum):
    PLANNED = "planned"
    RUNNING = "running"
    PASSED = "passed"
    FAILED = "failed"
    ABORTED = "aborted"

@dataclass
class ChaosExperiment:
    name: str
    hypothesis: str
    steady_state: dict
    target: str
    failure_type: str
    duration_seconds: int
    blast_radius: str
    status: ExperimentStatus = ExperimentStatus.PLANNED
    started_at: datetime = None
    ended_at: datetime = None

    def run(self, metrics_provider):
        self.status = ExperimentStatus.RUNNING
        self.started_at = datetime.now()
        baseline = metrics_provider.snapshot()
        inject_failure(self.target, self.failure_type, self.duration_seconds)
        observed = metrics_provider.snapshot()
        self.ended_at = datetime.now()
        self.status = (
            ExperimentStatus.PASSED
            if self._hypothesis_holds(baseline, observed)
            else ExperimentStatus.FAILED
        )
        return self.status

    def _hypothesis_holds(self, baseline, observed):
        for metric, threshold in self.steady_state.items():
            if observed.get(metric, 0) > threshold:
                return False
        return True

Chaos Engineering Tools

Tool Type Installation Kubernetes Production Ready Steady-State Verification
Chaos Monkey Instance termination Spinnaker-integrated Limited Yes Manual
LitmusChaos Cloud-native CRD Helm chart Native Yes Probes & metrics
Chaos Mesh Cloud-native CRD Helm chart / Operator Native Yes Stress + network + IO
Gremlin SaaS + agent Agent install Yes Yes Built-in health checks
AWS FIS AWS managed Console / SDK EKS only Yes CloudWatch integration
Toxiproxy TCP proxy Binary / Docker Sidecar Staging only None

LitmusChaos

LitmusChaos uses Kubernetes Custom Resource Definitions to define experiments. It runs as a set of controllers and operators inside the cluster.

# pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-delete
  namespace: chaos
spec:
  appinfo:
    appns: payment
    applabel: "app=payment-service"
    appkind: deployment
  chaosServiceAccount: litmus-admin
  monitoring:
    enabled: true
  experimentRun:
    repeat: false
    duration: 60
  experiments:
  - name: pod-delete
    spec:
      probe:
      - name: check-payment-availability
        type: httpProbe
        httpProbe/inputs:
          url: "http://payment-service:8080/health"
          insecure: true
          response:
            criteria: ==
            statusCode: 200
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '30'
        - name: CHAOS_INTERVAL
          value: '10'
        - name: FORCE
          value: 'true'
        - name: TARGET_PODS
          value: '1'
        - name: PODS_AFFECTED_PERC
          value: '50'

Chaos Mesh

Chaos Mesh provides fine-grained fault types including network partition, pod kill, IO delay, and DNS chaos.

# network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payment-to-db-partition
  namespace: chaos
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - payment
    labelSelectors:
      app: payment-service
  direction: both
  target:
    mode: all
    selector:
      namespaces:
        - database
      labelSelectors:
        app: postgres
  duration: 30s
  scheduler:
    cron: "@every 10m"

Gremlin (via API)

Gremlin offers a SaaS-managed platform with a CLI and API for fault injection.

# Attack a Kubernetes deployment with CPU exhaustion
$ gremlin attack kubernetes \
  --deployment payment-service \
  --cpu \
  --cpu-core-count 2 \
  --cpu-capacity-percent 80 \
  --length 60

# Attack with network latency
$ gremlin attack kubernetes \
  --deployment payment-service \
  --container payment \
  --latency \
  --latency-ms 500 \
  --length 30

AWS Fault Injection Simulator

AWS FIS integrates with CloudWatch alarms for automatic rollback.

# aws-fis-experiment.yaml
Description: "Terminate EC2 instance in ASG"
Targets:
  Instances:
    ResourceType: aws:ec2:instance
    SelectionMode: COUNT(1)
    ResourceTags:
      Environment: production
Actions:
  terminateInstances:
    ActionId: aws:ec2:terminate-instances
    Parameters:
      duration: PT1M
StopConditions:
  - Source: aws:cloudwatch:alarm
    Value: "payment-error-rate-high"

Blast Radius Control

Blast radius is the scope of damage an experiment can cause if it goes wrong. Control it with layered constraints:

  • Scope โ€” target a single pod, not a deployment; a single AZ, not a region
  • Duration โ€” short experiments (30-60 seconds) with automatic rollback
  • Health checks โ€” abort the experiment if key metrics cross thresholds
  • User impact โ€” target only during low-traffic windows for production experiments

Blast Radius Configuration

# blast-radius-config.yaml
blastRadius:
  maxTargets: 1
  maxPercentage: 5
  allowedNamespaces:
    - staging
    - chaos
  blockedNamespaces:
    - production-critical
  autoAbort:
    enabled: true
    conditions:
      - metric: error_rate
        threshold: 0.01
        window: 30s
      - metric: p99_latency
        threshold: 1000
        window: 30s
def should_abort(metrics: dict, guards: list) -> bool:
    for guard in guards:
        value = metrics.get(guard["metric"], 0)
        if value > guard["threshold"]:
            print(f"[ABORT] {guard['metric']} = {value} > {guard['threshold']}")
            return True
    return False

Automated vs Manual Experiments

Aspect Manual Automated
Frequency Monthly / per-release Continuous (every hour / every deploy)
Approval Human-in-loop GitOps / policy-driven
Scope Broad, exploratory Narrow, regression
Response Engineer investigates Auto-remediate or page
Tooling Ad-hoc scripts GameDay platform (Litmus, Chaos Mesh)

Automated experiments shine for known failure modes. Run pod-delete chaos on every deploy to verify your service recovers without intervention. Reserve manual experiments for novel failure scenarios where you need human analysis.

# Run automated chaos suite after each deployment
$ litmusctl create chaos-delegate --mode cluster
$ litmusctl create chaos-workflow \
  --name post-deploy-chaos \
  --experiment-file experiments/pod-delete.yaml

Production vs Staging

Staging chaos validates that your code handles failures. Production chaos validates that your system handles failures โ€” including real traffic patterns, real data volumes, and real network conditions.

Staging-first progression:

  1. Run experiments in a dedicated staging environment
  2. Move to a canary namespace in production (shadow traffic only)
  3. Run on a single non-critical service in production
  4. Expand after each step passes

Never run experiments in production without:

  • A documented hypothesis and expected outcome
  • Health check probes that can abort the experiment
  • A runbook for manual rollback
  • On-call engineer notified before execution
  • Time-bounded experiment with enforced max duration

Observability Integration

Chaos experiments are useless without observability. You need real-time visibility into the system to validate steady state and detect anomalies.

# prometheus-metrics-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: chaos-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: chaos-exporter
  endpoints:
  - port: metrics
    interval: 5s
  namespaceSelector:
    matchNames:
      - chaos
      - staging
import time
from prometheus_api_client import PrometheusConnect

PROM = PrometheusConnect(url="http://prometheus:9090", disable_ssl=True)

def monitor_experiment(experiment_name: str, duration: int, interval: int = 5):
    start = time.time()
    while time.time() - start < duration:
        error_rate = PROM.custom_query(
            'sum(rate(http_requests_total{status=~"5.."}[1m]))'
        )
        latency = PROM.custom_query(
            'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[1m])))'
        )
        print(f"[{experiment_name}] error_rate={error_rate}, p99_latency={latency}")
        if float(error_rate[0]["value"][1]) > 0.01:
            print("[ALERT] Error rate exceeded threshold โ€” aborting")
            return False
        time.sleep(interval)
    return True

Rollback Strategies

Every chaos experiment needs a rollback plan, even when the hypothesis is expected to hold.

Strategy Mechanism Speed Complexity
Time-bound Experiment self-terminates after T seconds Instant None
Metric-based Abort when error rate / latency crosses threshold Seconds Medium
Manual halt Engineer issues stop command Seconds to minutes Low
Git revert Revert experiment config in GitOps pipeline Minutes High
#!/bin/bash
# rollback-chaos.sh โ€” halt all active experiments
set -e

echo "[ROLLBACK] Stopping all chaos experiments..."

# LitmusChaos
kubectl delete chaosengine --all -n chaos 2>/dev/null || true

# Chaos Mesh
kubectl delete networkchaos --all -n chaos 2>/dev/null || true
kubectl delete podchaos --all -n chaos 2>/dev/null || true
kubectl delete stresschaos --all -n chaos 2>/dev/null || true

echo "[ROLLBACK] Verifying system health..."
kubectl get pods -l app=payment-service -n payment

echo "[ROLLBACK] Complete"

Building a Chaos Experiment Pipeline

A mature chaos practice integrates into CI/CD. Here is a GitHub Actions workflow that runs chaos experiments after staging deployment:

# .github/workflows/chaos-pipeline.yaml
name: Post-Deploy Chaos
on:
  deployment_status:
    states: [success]
jobs:
  chaos:
    runs-on: ubuntu-latest
    if: github.event.deployment.environment == 'staging'
    steps:
    - name: Install Litmus CLI
      run: |
        curl -sSL https://litmusctl.sh | bash
    - name: Run pod-delete experiment
      run: |
        litmusctl create chaos-workflow \
          --name "post-deploy-$(git rev-parse --short HEAD)" \
          --experiment-file experiments/pod-delete.yaml \
          --namespace chaos
    - name: Wait for experiment completion
      run: |
        sleep 60
        kubectl get chaosresult -n chaos -o json \
          | jq '.items[].spec.experimentStatus.verdict'
    - name: Notify on failure
      if: failure()
      run: |
        curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
          -H 'Content-Type: application/json' \
          -d '{"text": "Chaos experiment FAILED for staging deploy"}'

Writing a Custom Fault Injection in Go

For failure types not covered by existing tools, write a lightweight injector:

package main

import (
	"fmt"
	"io"
	"net/http"
	"os"
	"time"
)

func failHandler(w http.ResponseWriter, r *http.Request) {
	http.Error(w, "injected failure", http.StatusInternalServerError)
}

func latencyHandler(w http.ResponseWriter, r *http.Request) {
	time.Sleep(2 * time.Second)
	io.WriteString(w, "delayed response")
}

func main() {
	failureType := os.Getenv("FAILURE_TYPE")
	port := os.Getenv("PORT")
	if port == "" {
		port = "8080"
	}

	mux := http.NewServeMux()
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		io.WriteString(w, "ok")
	})

	switch failureType {
	case "http-500":
		mux.HandleFunc("/api/", failHandler)
	case "latency":
		mux.HandleFunc("/api/", latencyHandler)
	default:
		mux.HandleFunc("/api/", func(w http.ResponseWriter, r *http.Request) {
			io.WriteString(w, "normal response")
		})
	}

	fmt.Printf("Injector running on :%s with failure=%s\n", port, failureType)
	http.ListenAndServe(":"+port, mux)
}

Deploy as a sidecar or a separate service, then route traffic through it to simulate targeted failures.

Conclusion

Chaos engineering is not about breaking things โ€” it is about proving your system can survive when things break. Start with a single hypothesis, a small blast radius, and a clear steady-state definition. Automate experiments that validate known failure modes. Integrate observability so you can see what happens. Move to production only after staging validation builds confidence.

The goal is not zero failures. The goal is zero surprise failures.

Resources

Comments