Chaos Engineering: Resilience Testing in Production

Introduction

Chaos engineering is the practice of intentionally injecting failures into production systems to identify weaknesses before they cause real outages. By proactively testing failure scenarios, organizations build more resilient systems and gain confidence in their ability to handle failures. However, many teams fear chaos engineering, viewing it as risky rather than recognizing it as a disciplined approach to improving reliability.

This comprehensive guide covers chaos engineering principles, tools, and real-world implementation strategies.

Core Concepts & Terminology

Chaos Engineering

Discipline of experimenting on systems to build confidence in their ability to withstand turbulent conditions.

Blast Radius

Scope of impact from a chaos experiment.

Steady State

Normal operating conditions of a system.

Hypothesis

Expected behavior when failure is introduced.

Experiment

Controlled injection of failure to test hypothesis.

Blast Radius

Scope of impact from chaos experiment.

Rollback

Stopping experiment and restoring normal state.

Observability

Ability to understand system state during experiment.

Resilience

System’s ability to recover from failures.

Failure Mode

Specific way a system can fail.

Chaos Engineering Principles

1. Start Small

# Start with low-risk experiments
experiments = [
    {
        'name': 'Kill single pod',
        'blast_radius': 'low',
        'duration': '30s',
        'target': 'single pod'
    },
    {
        'name': 'Increase latency 10%',
        'blast_radius': 'low',
        'duration': '1m',
        'target': 'single service'
    },
    {
        'name': 'Partition network 5%',
        'blast_radius': 'medium',
        'duration': '2m',
        'target': 'specific region'
    }
]

2. Define Steady State

class SteadyStateValidator:
    def __init__(self):
        self.metrics = {
            'error_rate': 0.01,  # < 1%
            'p99_latency': 1.0,  # < 1s
            'availability': 0.999,  # > 99.9%
            'cpu_usage': 0.7,  # < 70%
            'memory_usage': 0.8  # < 80%
        }
    
    def validate(self, current_metrics):
        """Check if system is in steady state"""
        for metric, threshold in self.metrics.items():
            if current_metrics[metric] > threshold:
                return False, f"{metric} exceeded threshold"
        return True, "System in steady state"

3. Minimize Blast Radius

# Kubernetes chaos experiment with limited blast radius
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-single-pod
  namespace: default
spec:
  action: kill
  mode: one  # Kill only one pod
  duration: 30s
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  scheduler:
    cron: "0 2 * * *"  # Run at 2 AM

Chaos Experiments

1. Pod Failure

from kubernetes import client, config

def kill_pod(namespace, pod_name):
    """Kill a pod to test recovery"""
    config.load_incluster_config()
    v1 = client.CoreV1Api()
    
    try:
        v1.delete_namespaced_pod(
            name=pod_name,
            namespace=namespace,
            grace_period_seconds=0
        )
        print(f"Killed pod {pod_name}")
        return True
    except Exception as e:
        print(f"Failed to kill pod: {e}")
        return False

# Test pod recovery
kill_pod('production', 'api-server-1')
time.sleep(30)
verify_pod_recovered()

2. Network Latency

import subprocess

def inject_latency(interface, latency_ms, duration_s):
    """Inject network latency using tc (traffic control)"""
    
    # Add latency
    subprocess.run([
        'tc', 'qdisc', 'add', 'dev', interface,
        'root', 'netem', 'delay', f'{latency_ms}ms'
    ])
    
    print(f"Injected {latency_ms}ms latency for {duration_s}s")
    time.sleep(duration_s)
    
    # Remove latency
    subprocess.run([
        'tc', 'qdisc', 'del', 'dev', interface, 'root'
    ])
    
    print("Latency injection complete")

# Test application behavior with latency
inject_latency('eth0', 500, 60)  # 500ms latency for 60 seconds

3. Resource Exhaustion

def exhaust_cpu(duration_s, cpu_percent):
    """Exhaust CPU to test scaling"""
    import multiprocessing
    
    def cpu_burn():
        end_time = time.time() + duration_s
        while time.time() < end_time:
            _ = sum(i*i for i in range(1000000))
    
    # Calculate number of processes
    num_cpus = multiprocessing.cpu_count()
    num_processes = int(num_cpus * cpu_percent / 100)
    
    # Start processes
    processes = []
    for _ in range(num_processes):
        p = multiprocessing.Process(target=cpu_burn)
        p.start()
        processes.append(p)
    
    print(f"Burning {num_processes} CPUs for {duration_s}s")
    
    # Wait for completion
    for p in processes:
        p.join()
    
    print("CPU burn complete")

# Test autoscaling
exhaust_cpu(120, 80)  # 80% CPU for 2 minutes

4. Database Failure

def simulate_database_failure(duration_s):
    """Simulate database connection failure"""
    
    # Block database connections
    subprocess.run([
        'iptables', '-A', 'OUTPUT',
        '-p', 'tcp', '--dport', '5432',
        '-j', 'DROP'
    ])
    
    print(f"Database blocked for {duration_s}s")
    time.sleep(duration_s)
    
    # Restore connections
    subprocess.run([
        'iptables', '-D', 'OUTPUT',
        '-p', 'tcp', '--dport', '5432',
        '-j', 'DROP'
    ])
    
    print("Database restored")

# Test database failover
simulate_database_failure(60)
verify_failover_worked()

Chaos Tools

Chaos Mesh

# Kubernetes-native chaos engineering platform
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-partition
  namespace: default
spec:
  action: partition
  mode: all
  duration: 5m
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  direction: both
  target:
    namespaces:
      - production
    labelSelectors:
      app: database

Gremlin

# Gremlin API for chaos experiments
import requests

class GremlinClient:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.gremlin.com/v1"
    
    def create_experiment(self, experiment):
        """Create chaos experiment"""
        headers = {
            'Authorization': f'Key {self.api_key}',
            'Content-Type': 'application/json'
        }
        
        response = requests.post(
            f'{self.base_url}/experiments',
            json=experiment,
            headers=headers
        )
        
        return response.json()
    
    def stop_experiment(self, experiment_id):
        """Stop running experiment"""
        headers = {
            'Authorization': f'Key {self.api_key}'
        }
        
        response = requests.delete(
            f'{self.base_url}/experiments/{experiment_id}',
            headers=headers
        )
        
        return response.json()

# Usage
client = GremlinClient('your-api-key')

experiment = {
    'name': 'Kill random pod',
    'description': 'Test pod recovery',
    'target': {
        'type': 'Kubernetes',
        'selector': {
            'namespace': 'production',
            'labels': {'app': 'api-server'}
        }
    },
    'command': {
        'type': 'kill',
        'args': ['-s', 'SIGKILL']
    },
    'duration': 30
}

result = client.create_experiment(experiment)
print(f"Experiment created: {result['id']}")

Observability During Chaos

class ChaosObserver:
    def __init__(self, prometheus_url):
        self.prometheus_url = prometheus_url
    
    def get_metrics(self, query, start_time, end_time):
        """Get metrics during chaos experiment"""
        response = requests.get(
            f'{self.prometheus_url}/api/v1/query_range',
            params={
                'query': query,
                'start': start_time,
                'end': end_time,
                'step': '15s'
            }
        )
        return response.json()
    
    def validate_hypothesis(self, metrics, expected_behavior):
        """Validate if system behaved as expected"""
        for metric_name, expected_value in expected_behavior.items():
            actual_value = metrics.get(metric_name)
            
            if actual_value > expected_value:
                return False, f"{metric_name} exceeded expected value"
        
        return True, "Hypothesis validated"

# Monitor during experiment
observer = ChaosObserver('http://prometheus:9090')

# Start experiment
start_time = time.time()
kill_pod('production', 'api-server-1')
end_time = time.time()

# Analyze results
metrics = observer.get_metrics(
    'rate(http_requests_total[5m])',
    start_time,
    end_time
)

expected = {
    'error_rate': 0.05,  # Allow 5% errors
    'latency_p99': 2.0   # Allow 2s latency
}

success, message = observer.validate_hypothesis(metrics, expected)
print(f"Experiment result: {message}")

Best Practices

Start Small: Begin with low-risk experiments
Define Steady State: Know what normal looks like
Minimize Blast Radius: Limit scope of experiments
Automate Rollback: Automatically stop experiments
Observe Everything: Comprehensive monitoring
Document Findings: Record what you learn
Iterate: Run experiments regularly
Team Involvement: Include ops and dev teams
Blameless Culture: Focus on learning, not blame
Continuous Improvement: Use findings to improve systems

External Resources

Tools

Learning

Conclusion

Chaos engineering is a disciplined approach to building resilient systems. By intentionally injecting failures, you identify weaknesses before they cause real outages.

Start with small experiments, observe carefully, and use findings to improve your systems. Chaos engineering is not about breaking things—it’s about building confidence in your ability to handle failures.

Chaos engineering builds resilient systems.

Introduction

Core Concepts & Terminology

Chaos Engineering

Blast Radius

Steady State

Hypothesis

Experiment

Blast Radius

Rollback

Observability

Resilience

Failure Mode

Chaos Engineering Principles

1. Start Small

2. Define Steady State

3. Minimize Blast Radius

Chaos Experiments

1. Pod Failure

2. Network Latency

3. Resource Exhaustion

4. Database Failure

Chaos Tools

Chaos Mesh

Gremlin

Observability During Chaos

Best Practices

External Resources

Tools

Learning

Conclusion

Comments