Introduction
Chaos engineering proactively tests system resilience by introducing failures. This builds confidence that systems can withstand real-world conditions.
Principles
# Chaos Engineering Principles
# 1. Define steady state
steady_state = {
"availability": 0.99,
"latency_p99": 200, # ms
"error_rate": 0.01
}
# 2. Hypothesize about failure modes
hypotheses = [
"If database fails, service returns cached data",
"If API times out, circuit breaker opens",
"If service restarts, requests are retried"
]
# 3. Design experiments
experiments = [
{
"name": "Kill database pod",
"action": "terminate_pod",
"target": "postgres-0",
"expected": "Service uses cache, returns 200"
},
{
"name": "Add network latency",
"action": "add_latency",
"target": "payment-service",
"latency_ms": 500,
"expected": "Circuit breaker opens"
}
]
# 4. Verify in production
# Start with staging, then production
# Monitor metrics during experiments
LitmusChaos Experiments
# chaos-engineering/pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: default
spec:
appinfo:
appns: default
applabel: "app=api"
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
Conclusion
Chaos engineering builds confidence through controlled experiments. Start with hypotheses, design small experiments, measure impact, and iterate. Build resilience before failures occur.
Resources
- Principles of Chaos Engineering
- LitmusChaos Documentation
Comments