Introduction
Chaos engineering is the practice of intentionally injecting failures into production systems to identify weaknesses before they cause real outages. By proactively testing failure scenarios, organizations build more resilient systems and gain confidence in their ability to handle failures. However, many teams fear chaos engineering, viewing it as risky rather than recognizing it as a disciplined approach to improving reliability.
This comprehensive guide covers chaos engineering principles, tools, and real-world implementation strategies.
Core Concepts & Terminology
Chaos Engineering
Discipline of experimenting on systems to build confidence in their ability to withstand turbulent conditions.
Blast Radius
Scope of impact from a chaos experiment.
Steady State
Normal operating conditions of a system.
Hypothesis
Expected behavior when failure is introduced.
Experiment
Controlled injection of failure to test hypothesis.
Blast Radius
Scope of impact from chaos experiment.
Rollback
Stopping experiment and restoring normal state.
Observability
Ability to understand system state during experiment.
Resilience
System’s ability to recover from failures.
Failure Mode
Specific way a system can fail.
Chaos Engineering Principles
1. Start Small
# Start with low-risk experiments
experiments = [
{
'name': 'Kill single pod',
'blast_radius': 'low',
'duration': '30s',
'target': 'single pod'
},
{
'name': 'Increase latency 10%',
'blast_radius': 'low',
'duration': '1m',
'target': 'single service'
},
{
'name': 'Partition network 5%',
'blast_radius': 'medium',
'duration': '2m',
'target': 'specific region'
}
]
2. Define Steady State
class SteadyStateValidator:
def __init__(self):
self.metrics = {
'error_rate': 0.01, # < 1%
'p99_latency': 1.0, # < 1s
'availability': 0.999, # > 99.9%
'cpu_usage': 0.7, # < 70%
'memory_usage': 0.8 # < 80%
}
def validate(self, current_metrics):
"""Check if system is in steady state"""
for metric, threshold in self.metrics.items():
if current_metrics[metric] > threshold:
return False, f"{metric} exceeded threshold"
return True, "System in steady state"
3. Minimize Blast Radius
# Kubernetes chaos experiment with limited blast radius
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-single-pod
namespace: default
spec:
action: kill
mode: one # Kill only one pod
duration: 30s
selector:
namespaces:
- production
labelSelectors:
app: api-server
scheduler:
cron: "0 2 * * *" # Run at 2 AM
Chaos Experiments
1. Pod Failure
from kubernetes import client, config
def kill_pod(namespace, pod_name):
"""Kill a pod to test recovery"""
config.load_incluster_config()
v1 = client.CoreV1Api()
try:
v1.delete_namespaced_pod(
name=pod_name,
namespace=namespace,
grace_period_seconds=0
)
print(f"Killed pod {pod_name}")
return True
except Exception as e:
print(f"Failed to kill pod: {e}")
return False
# Test pod recovery
kill_pod('production', 'api-server-1')
time.sleep(30)
verify_pod_recovered()
2. Network Latency
import subprocess
def inject_latency(interface, latency_ms, duration_s):
"""Inject network latency using tc (traffic control)"""
# Add latency
subprocess.run([
'tc', 'qdisc', 'add', 'dev', interface,
'root', 'netem', 'delay', f'{latency_ms}ms'
])
print(f"Injected {latency_ms}ms latency for {duration_s}s")
time.sleep(duration_s)
# Remove latency
subprocess.run([
'tc', 'qdisc', 'del', 'dev', interface, 'root'
])
print("Latency injection complete")
# Test application behavior with latency
inject_latency('eth0', 500, 60) # 500ms latency for 60 seconds
3. Resource Exhaustion
def exhaust_cpu(duration_s, cpu_percent):
"""Exhaust CPU to test scaling"""
import multiprocessing
def cpu_burn():
end_time = time.time() + duration_s
while time.time() < end_time:
_ = sum(i*i for i in range(1000000))
# Calculate number of processes
num_cpus = multiprocessing.cpu_count()
num_processes = int(num_cpus * cpu_percent / 100)
# Start processes
processes = []
for _ in range(num_processes):
p = multiprocessing.Process(target=cpu_burn)
p.start()
processes.append(p)
print(f"Burning {num_processes} CPUs for {duration_s}s")
# Wait for completion
for p in processes:
p.join()
print("CPU burn complete")
# Test autoscaling
exhaust_cpu(120, 80) # 80% CPU for 2 minutes
4. Database Failure
def simulate_database_failure(duration_s):
"""Simulate database connection failure"""
# Block database connections
subprocess.run([
'iptables', '-A', 'OUTPUT',
'-p', 'tcp', '--dport', '5432',
'-j', 'DROP'
])
print(f"Database blocked for {duration_s}s")
time.sleep(duration_s)
# Restore connections
subprocess.run([
'iptables', '-D', 'OUTPUT',
'-p', 'tcp', '--dport', '5432',
'-j', 'DROP'
])
print("Database restored")
# Test database failover
simulate_database_failure(60)
verify_failover_worked()
Chaos Tools
Chaos Mesh
# Kubernetes-native chaos engineering platform
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: default
spec:
action: partition
mode: all
duration: 5m
selector:
namespaces:
- production
labelSelectors:
app: api-server
direction: both
target:
namespaces:
- production
labelSelectors:
app: database
Gremlin
# Gremlin API for chaos experiments
import requests
class GremlinClient:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.gremlin.com/v1"
def create_experiment(self, experiment):
"""Create chaos experiment"""
headers = {
'Authorization': f'Key {self.api_key}',
'Content-Type': 'application/json'
}
response = requests.post(
f'{self.base_url}/experiments',
json=experiment,
headers=headers
)
return response.json()
def stop_experiment(self, experiment_id):
"""Stop running experiment"""
headers = {
'Authorization': f'Key {self.api_key}'
}
response = requests.delete(
f'{self.base_url}/experiments/{experiment_id}',
headers=headers
)
return response.json()
# Usage
client = GremlinClient('your-api-key')
experiment = {
'name': 'Kill random pod',
'description': 'Test pod recovery',
'target': {
'type': 'Kubernetes',
'selector': {
'namespace': 'production',
'labels': {'app': 'api-server'}
}
},
'command': {
'type': 'kill',
'args': ['-s', 'SIGKILL']
},
'duration': 30
}
result = client.create_experiment(experiment)
print(f"Experiment created: {result['id']}")
Observability During Chaos
class ChaosObserver:
def __init__(self, prometheus_url):
self.prometheus_url = prometheus_url
def get_metrics(self, query, start_time, end_time):
"""Get metrics during chaos experiment"""
response = requests.get(
f'{self.prometheus_url}/api/v1/query_range',
params={
'query': query,
'start': start_time,
'end': end_time,
'step': '15s'
}
)
return response.json()
def validate_hypothesis(self, metrics, expected_behavior):
"""Validate if system behaved as expected"""
for metric_name, expected_value in expected_behavior.items():
actual_value = metrics.get(metric_name)
if actual_value > expected_value:
return False, f"{metric_name} exceeded expected value"
return True, "Hypothesis validated"
# Monitor during experiment
observer = ChaosObserver('http://prometheus:9090')
# Start experiment
start_time = time.time()
kill_pod('production', 'api-server-1')
end_time = time.time()
# Analyze results
metrics = observer.get_metrics(
'rate(http_requests_total[5m])',
start_time,
end_time
)
expected = {
'error_rate': 0.05, # Allow 5% errors
'latency_p99': 2.0 # Allow 2s latency
}
success, message = observer.validate_hypothesis(metrics, expected)
print(f"Experiment result: {message}")
Best Practices
- Start Small: Begin with low-risk experiments
- Define Steady State: Know what normal looks like
- Minimize Blast Radius: Limit scope of experiments
- Automate Rollback: Automatically stop experiments
- Observe Everything: Comprehensive monitoring
- Document Findings: Record what you learn
- Iterate: Run experiments regularly
- Team Involvement: Include ops and dev teams
- Blameless Culture: Focus on learning, not blame
- Continuous Improvement: Use findings to improve systems
External Resources
Tools
Learning
Conclusion
Chaos engineering is a disciplined approach to building resilient systems. By intentionally injecting failures, you identify weaknesses before they cause real outages.
Start with small experiments, observe carefully, and use findings to improve your systems. Chaos engineering is not about breaking thingsโit’s about building confidence in your ability to handle failures.
Chaos engineering builds resilient systems.
Comments