Skip to main content
โšก Calmops

Chaos Engineering: Building Resilient Systems Through Controlled Experiments

Introduction

In 2026, chaos engineering has evolved from a radical practice to a core discipline in Site Reliability Engineering (SRE). As systems grow more complex and distributed, traditional testing approaches can’t catch all failure modes. Chaos engineering fills this gap by proactively injecting failures into production systems to discover weaknesses before they manifest as outages.

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production.

The Chaos Engineering Process

The Scientific Method

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Chaos Engineering Loop                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1. Define Steady State                                    โ”‚
โ”‚     โ†“                                                       โ”‚
โ”‚  2. Hypothesize System Behavior                             โ”‚
โ”‚     โ†“                                                       โ”‚
โ”‚  3. Introduce Real-world Failure                            โ”‚
โ”‚     โ†“                                                       โ”‚
โ”‚  4. Observe and Measure                                     โ”‚
โ”‚     โ†“                                                       โ”‚
โ”‚  5. Correct and Improve                                    โ”‚
โ”‚     โ†“                                                       โ”‚
โ”‚  6. Automate and Repeat                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Defining Steady State

What does “normal” look like for your system?

# Example steady state metrics
steady_state = {
    "error_rate": lambda: get_error_rate() < 0.01,  # < 1% errors
    "latency_p99": lambda: get_p99_latency() < 500,   # < 500ms
    "throughput": lambda: get_rps() > 1000,           # > 1000 RPS
    "availability": lambda: get_availability() > 0.999  # 99.9%
}

Chaos Engineering Principles

1. Build a Hypothesis

hypothesis = {
    "statement": "If the payment service database fails over, " 
                 "users will experience errors but the system "
                 "will continue to serve traffic within SLA.",
    "steady_state_metrics": ["error_rate", "latency_p99"],
    "expected_outcome": "Error rate spikes < 5%, recovers in < 30s"
}

2. Vary Real-World Events

Category Examples
Infrastructure Server failure, network latency, DNS issues
Application Process crash, memory leak, exception
Dependencies API timeout, third-party outage
Data Data corruption, schema mismatch

3. Run in Production

Testing in production reveals true system behavior:

  • Real traffic patterns
  • Actual failure modes
  • True customer impact

4. Automate Experiments

# Chaos Mesh experiment definition
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-payment-pods
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: one
  duration: "30s"
  selector:
    namespaces:
      - payments
    labelSelectors:
      app: payment-service

Chaos Mesh

# Network latency experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-latency
spec:
  action: delay
  mode: one
  duration: "60s"
  selector:
    namespaces:
      - payments
  delay:
    latency: "500ms"
    correlation: "25%"

LitmusChaos

# Kubernetes pod kill experiment
apiVersion: litmuschaos.io/v1alpha1
kind: Engine
metadata:
  name: pod-kill-chaos
  namespace: litmus
spec:
  appinfo:
    appns: payments
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "10"

Gremlin

Commercial chaos engineering platform with:

  • Drag-and-drop experiment builder
  • Managed chaos infrastructure
  • Team collaboration features
  • Attack library

Implementing Chaos Engineering

Step 1: Start Small

Begin with low-impact experiments:

  1. Stop a single pod: Verify auto-scaling works
  2. Add network latency: Check timeout handling
  3. Fill disk space: Test logging behavior

Step 2: Build Observability

Before injecting chaos, ensure you can observe it:

# Metrics to track during experiments
experiment_metrics = {
    "system_metrics": [
        "cpu_usage",
        "memory_usage", 
        "disk_io",
        "network_io"
    ],
    "application_metrics": [
        "request_rate",
        "error_rate",
        "latency_p50",
        "latency_p95",
        "latency_p99"
    ],
    "business_metrics": [
        "orders_completed",
        "payments_processed",
        "user_signups"
    ]
}

Step 3: Create a Chaos Committee

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Chaos Committee             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ๐Ÿ“‹ Experiment Approval             โ”‚
โ”‚  ๐Ÿ”’ Safety Guardrails              โ”‚
โ”‚  ๐Ÿ“Š Results Review                  โ”‚
โ”‚  ๐Ÿ“ˆ Improvement Tracking            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 4: Define Stop Conditions

# Experiment abort conditions
abort_conditions:
  - error_rate > 0.10  # 10% errors
  - latency_p99 > 2000ms
  - availability < 0.950
  - customer_complaints > 0
  
rollback_plan:
  - disable_experiment
  - scale_up_replicas
  - alert_on-call_engineer

Common Chaos Experiments

1. Service Failure

# Kill a specific microservice
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-user-service
spec:
  action: pod-failure
  mode: one
  duration: "60s"
  selector:
    namespaces:
      - users
    labelSelectors:
      app: user-service

2. Network Partition

# Partition services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: partition-services
spec:
  action: partition
  mode: all
  duration: "120s"
  selector:
    namespaces:
      - payments
  direction: both
  target:
    selector:
      namespaces:
        - orders

3. Resource Exhaustion

# Memory stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
spec:
  mode: one
  duration: "60s"
  selector:
    namespaces:
      - payments
  stressors:
    memory:
      workers: 1
      size: "1GB"

4. Dependency Failure

# External API failure
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: dns-failure
spec:
  action: error
  mode: one
  duration: "30s"
  selector:
    namespaces:
      - payments
  dnsNames:
    - "api.example.com"
  dnsServer: "8.8.8.8"

Measuring Chaos Engineering Success

Key Metrics

Metric Target Description
MTTR < 15 min Mean time to recovery
Experiment frequency Weekly How often you run experiments
Blast radius < 5% Max affected users
New findings Increasing Weaknesses discovered

The Resilience Score

def calculate_resilience_score(results):
    score = 100
    
    # Deduct for impact
    score -= results.error_rate_increase * 100
    score -= results.latency_increase / 10
    
    # Add for recovery
    if results.auto_recovered:
        score += 10
    
    # Cap at 0-100
    return max(0, min(100, score))

Best Practices

  1. Start non-production: Prove concepts in staging
  2. Informed stakeholders: Communication is key
  3. Incremental complexity: Build up to more complex experiments
  4. Document everything: Capture learnings
  5. Share results: Build organizational knowledge

Anti-Patterns

Avoid these mistakes:

  • โŒ Testing in production without approval
  • โŒ No stop button for experiments
  • โŒ Running experiments during peak hours
  • โŒ Ignoring experiment results
  • โŒ Blaming teams for discovered issues
  • โŒ Testing too frequently (alert fatigue)

Conclusion

Chaos engineering transforms how we think about system reliability. By proactively injecting failures, teams discover weaknesses before customers do. The key is to start small, build observability, and maintain a culture of learning rather than blame.

In 2026, chaos engineering is essential for any organization running complex distributed systems. The question isn’t whether to adopt chaos engineering, but how quickly you can start discovering your system’s hidden weaknesses.

Comments