Skip to main content
โšก Calmops

Alerting Strategy: Alert Fatigue, Runbooks, Escalation

Introduction

Alert fatigue is real. When engineers receive too many alerts, they ignore themโ€”leading to missed critical issues. Effective alerting requires careful design and continuous improvement.

Key Statistics:

  • Average engineer receives 100+ alerts daily
  • 70% of alerts are not actionable
  • Alert fatigue causes 25% of breaches
  • SRE teams spend 25% of time on alerts

Alert Hierarchy

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Alert Tiers                                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Tier 1: Critical (P1)                                          โ”‚
โ”‚  โ”œโ”€โ”€ Service down                                                โ”‚
โ”‚  โ”œโ”€โ”€ Data loss imminent                                          โ”‚
โ”‚  โ”œโ”€โ”€ Security breach                                             โ”‚
โ”‚  โ””โ”€โ”€ Response: Immediate, automated escalation                   โ”‚
โ”‚                                                                  โ”‚
โ”‚  Tier 2: High (P2)                                              โ”‚
โ”‚  โ”œโ”€โ”€ Performance degradation (>50%)                              โ”‚
โ”‚  โ”œโ”€โ”€ Error rate increase (>5%)                                   โ”‚
โ”‚  โ””โ”€โ”€ Response: Within 15 minutes, on-call paged                 โ”‚
โ”‚                                                                  โ”‚
โ”‚  Tier 3: Medium (P3)                                             โ”‚
โ”‚  โ”œโ”€โ”€ Minor anomalies                                             โ”‚
โ”‚  โ”œโ”€โ”€ Capacity warnings (>80%)                                   โ”‚
โ”‚  โ””โ”€โ”€ Response: Within 1 hour, ticket created                     โ”‚
โ”‚                                                                  โ”‚
โ”‚  Tier 4: Low (P4)                                                โ”‚
โ”‚  โ”œโ”€โ”€ Informational                                               โ”‚
โ”‚  โ”œโ”€โ”€ Scheduled maintenance                                       โ”‚
โ”‚  โ””โ”€โ”€ Response: Next business day                                  โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Design

Good vs Bad Alerts

# BAD ALERTS:
# โŒ Alert on every error
- alert: AnyError
  expr: rate(errors_total[5m]) > 0
  # Problem: False positives, noise

# โŒ No context
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(latency[5m])) > 1
  # Problem: No severity, no action

# โŒ Fragile conditions
- alert: PodRestarted
  expr: changes(kube_pod_container_status_restarts[5m]) > 0
  # Problem: Too sensitive

# GOOD ALERTS:
# โœ… Clear severity
- alert: ServiceOutage
  expr: up{job="api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "API service is down"
    description: "API has been down for 2 minutes"
    runbook_url: "https://wiki/runbooks/service-outage"

# โœ… Actionable
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "Error rate above 5%"
    description: "Error rate is {{ $value | humanizePercentage }}"
    runbook_url: "https://wiki/runbooks/high-error-rate"

# โœ… Meaningful threshold
- alert: DatabaseConnectionPoolExhausted
  expr: |
    sum(rate(database_connections_acquired_total[5m])) 
    / 
    sum(database_max_connections) > 0.9
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Database connection pool > 90%"
    description: "Pool utilization is {{ $value | humanizePercentage }}"

Runbooks

Runbook Structure

# Runbook Template
runbook:
  title: "Service High Latency"
  id: "HIGH-LATENCY-001"
  severity: "P2"
  impact: "Users experiencing slow response times"
  
  triggers:
    - "P99 latency > 2 seconds for 5 minutes"
    - "P99 latency > 5 seconds for 2 minutes"
  
  steps:
    - name: "Check service health"
      description: |
        1. Check Grafana dashboard: [Link]
        2. Look for recent deployments
        3. Check error rates
      
    - name: "Check dependencies"
      description: |
        1. Check if database is slow
        2. Check if cache is saturated
        3. Check upstream services
      
    - name: "Check capacity"
      description: |
        1. CPU/Memory utilization
        2. Auto-scaling status
        3. Database connections
      
    - name: "Remediation"
      description: |
        1. Scale up if needed: `kubectl scale deployment api --replicas=10`
        2. Rollback if recent deploy: `kubectl rollout undo deployment/api`
        3. Enable caching if DB is bottleneck
      
    - name: "Communications"
      description: |
        1. Update status page
        2. Notify #ops channel
        3. Update on-call if longer than 30 min
  
  automation:
    - name: "Auto-scale"
      action: |
        kubectl autoscale deployment api --min=5 --max=20 --cpu-percent=70
      
    - name: "Auto-rollback"
      trigger: "If deployment within last 15 minutes"
      action: |
        kubectl rollout undo deployment/api

Runbook Automation

#!/usr/bin/env python3
"""Automated runbook execution."""

class RunbookAutomation:
    """Execute runbook steps automatically."""
    
    def __init__(self, k8s_client, alert_manager):
        self.k8s = k8s_client
        self.alert_manager = alert_manager
    
    def execute_runbook(self, alert):
        """Execute runbook for alert."""
        
        runbook = self.get_runbook(alert.name)
        
        if not runbook:
            return {"status": "no_runbook", "message": "No runbook found"}
        
        results = []
        
        for step in runbook.steps:
            if step.get("automated"):
                result = self.execute_step(step)
                results.append({
                    "step": step["name"],
                    "status": result["status"],
                    "output": result.get("output")
                })
                
                if result["status"] == "failed":
                    break
        
        return {
            "alert": alert.name,
            "runbook": runbook.title,
            "steps_executed": results
        }
    
    def execute_step(self, step):
        """Execute automated step."""
        
        action = step["action"]
        
        try:
            if action.startswith("kubectl"):
                output = self.k8s.execute(action)
                return {"status": "success", "output": output}
            
            return {"status": "skipped", "message": "Action not implemented"}
        
        except Exception as e:
            return {"status": "failed", "error": str(e)}

Escalation

Escalation Policy

# Escalation Policy
escalation_policy:
  name: "Standard On-Call Escalation"
  levels:
    - level: 1
      on_call: "Primary On-Call"
      timeout: 15m
      notify:
        - type: "pagerduty"
          target: "primary"
        - type: "sms"
          target: "phone"
        - type: "email"
          target: "email"
    
    - level: 2
      on_call: "Secondary On-Call"
      timeout: 15m
      notify:
        - type: "pagerduty"
          target: "secondary"
    
    - level: 3
      on_call: "Engineering Manager"
      timeout: 30m
      notify:
        - type: "phone"
        - type: "slack"
          channel: "#incidents"
    
    - level: 4
      on_call: "VP Engineering"
      timeout: 60m
      notify:
        - type: "email"

# Severity-based routing
routing_rules:
  - severity: "critical"
    policy: "Standard On-Call Escalation"
    auto_escalate: true
    
  - severity: "high"
    policy: "Standard On-Call Escalation"
    auto_escalate: true
    
  - severity: "medium"
    policy: "Standard On-Call Escalation"
    auto_escalate: false
    
  - severity: "low"
    policy: "No Escalation"
    auto_escalate: false

Alert Reduction

SLO-Based Alerting

# Alert only when SLO is at risk
groups:
  - name: "slo-alerts"
    rules:
      # Alert when burn rate is high
      - alert: "SLOBurnRateHigh"
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) 
            / 
            sum(rate(http_requests_total[1h]))
          ) 
          / (1 - 0.999) > 1.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate is high"
          description: "At current rate, error budget will be exhausted in {{ $value | mul 30 | round 1 }} days"
      
      # Alert when budget exhausted
      - alert: "SLOBudgetExhausted"
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30d])) 
            / 
            sum(rate(http_requests_total[30d]))
          ) 
          / (1 - 0.999) >= 1
        labels:
          severity: critical
        annotations:
          summary: "SLO budget exhausted!"
          description: "Error budget has been exhausted. Immediate action required."

External Resources


Comments