Alerting Strategy: Alert Fatigue, Runbooks, Escalation

Introduction

Alert fatigue is real. When engineers receive too many alerts, they ignore them—leading to missed critical issues. Effective alerting requires careful design and continuous improvement.

Key Statistics:

Average engineer receives 100+ alerts daily
70% of alerts are not actionable
Alert fatigue causes 25% of breaches
SRE teams spend 25% of time on alerts

Alert Hierarchy

┌─────────────────────────────────────────────────────────────────┐
│                    Alert Tiers                                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Tier 1: Critical (P1)                                          │
│  ├── Service down                                                │
│  ├── Data loss imminent                                          │
│  ├── Security breach                                             │
│  └── Response: Immediate, automated escalation                   │
│                                                                  │
│  Tier 2: High (P2)                                              │
│  ├── Performance degradation (>50%)                              │
│  ├── Error rate increase (>5%)                                   │
│  └── Response: Within 15 minutes, on-call paged                 │
│                                                                  │
│  Tier 3: Medium (P3)                                             │
│  ├── Minor anomalies                                             │
│  ├── Capacity warnings (>80%)                                   │
│  └── Response: Within 1 hour, ticket created                     │
│                                                                  │
│  Tier 4: Low (P4)                                                │
│  ├── Informational                                               │
│  ├── Scheduled maintenance                                       │
│  └── Response: Next business day                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Alert Design

Good vs Bad Alerts

# BAD ALERTS:
# ❌ Alert on every error
- alert: AnyError
  expr: rate(errors_total[5m]) > 0
  # Problem: False positives, noise

# ❌ No context
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(latency[5m])) > 1
  # Problem: No severity, no action

# ❌ Fragile conditions
- alert: PodRestarted
  expr: changes(kube_pod_container_status_restarts[5m]) > 0
  # Problem: Too sensitive

# GOOD ALERTS:
# ✅ Clear severity
- alert: ServiceOutage
  expr: up{job="api"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "API service is down"
    description: "API has been down for 2 minutes"
    runbook_url: "https://wiki/runbooks/service-outage"

# ✅ Actionable
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / 
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "Error rate above 5%"
    description: "Error rate is {{ $value | humanizePercentage }}"
    runbook_url: "https://wiki/runbooks/high-error-rate"

# ✅ Meaningful threshold
- alert: DatabaseConnectionPoolExhausted
  expr: |
    sum(rate(database_connections_acquired_total[5m])) 
    / 
    sum(database_max_connections) > 0.9
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Database connection pool > 90%"
    description: "Pool utilization is {{ $value | humanizePercentage }}"

Runbooks

Runbook Structure

# Runbook Template
runbook:
  title: "Service High Latency"
  id: "HIGH-LATENCY-001"
  severity: "P2"
  impact: "Users experiencing slow response times"
  
  triggers:
    - "P99 latency > 2 seconds for 5 minutes"
    - "P99 latency > 5 seconds for 2 minutes"
  
  steps:
    - name: "Check service health"
      description: |
        1. Check Grafana dashboard: [Link]
        2. Look for recent deployments
        3. Check error rates
      
    - name: "Check dependencies"
      description: |
        1. Check if database is slow
        2. Check if cache is saturated
        3. Check upstream services
      
    - name: "Check capacity"
      description: |
        1. CPU/Memory utilization
        2. Auto-scaling status
        3. Database connections
      
    - name: "Remediation"
      description: |
        1. Scale up if needed: `kubectl scale deployment api --replicas=10`
        2. Rollback if recent deploy: `kubectl rollout undo deployment/api`
        3. Enable caching if DB is bottleneck
      
    - name: "Communications"
      description: |
        1. Update status page
        2. Notify #ops channel
        3. Update on-call if longer than 30 min
  
  automation:
    - name: "Auto-scale"
      action: |
        kubectl autoscale deployment api --min=5 --max=20 --cpu-percent=70
      
    - name: "Auto-rollback"
      trigger: "If deployment within last 15 minutes"
      action: |
        kubectl rollout undo deployment/api

Runbook Automation

#!/usr/bin/env python3
"""Automated runbook execution."""

class RunbookAutomation:
    """Execute runbook steps automatically."""
    
    def __init__(self, k8s_client, alert_manager):
        self.k8s = k8s_client
        self.alert_manager = alert_manager
    
    def execute_runbook(self, alert):
        """Execute runbook for alert."""
        
        runbook = self.get_runbook(alert.name)
        
        if not runbook:
            return {"status": "no_runbook", "message": "No runbook found"}
        
        results = []
        
        for step in runbook.steps:
            if step.get("automated"):
                result = self.execute_step(step)
                results.append({
                    "step": step["name"],
                    "status": result["status"],
                    "output": result.get("output")
                })
                
                if result["status"] == "failed":
                    break
        
        return {
            "alert": alert.name,
            "runbook": runbook.title,
            "steps_executed": results
        }
    
    def execute_step(self, step):
        """Execute automated step."""
        
        action = step["action"]
        
        try:
            if action.startswith("kubectl"):
                output = self.k8s.execute(action)
                return {"status": "success", "output": output}
            
            return {"status": "skipped", "message": "Action not implemented"}
        
        except Exception as e:
            return {"status": "failed", "error": str(e)}

Escalation

Escalation Policy

# Escalation Policy
escalation_policy:
  name: "Standard On-Call Escalation"
  levels:
    - level: 1
      on_call: "Primary On-Call"
      timeout: 15m
      notify:
        - type: "pagerduty"
          target: "primary"
        - type: "sms"
          target: "phone"
        - type: "email"
          target: "email"
    
    - level: 2
      on_call: "Secondary On-Call"
      timeout: 15m
      notify:
        - type: "pagerduty"
          target: "secondary"
    
    - level: 3
      on_call: "Engineering Manager"
      timeout: 30m
      notify:
        - type: "phone"
        - type: "slack"
          channel: "#incidents"
    
    - level: 4
      on_call: "VP Engineering"
      timeout: 60m
      notify:
        - type: "email"

# Severity-based routing
routing_rules:
  - severity: "critical"
    policy: "Standard On-Call Escalation"
    auto_escalate: true
    
  - severity: "high"
    policy: "Standard On-Call Escalation"
    auto_escalate: true
    
  - severity: "medium"
    policy: "Standard On-Call Escalation"
    auto_escalate: false
    
  - severity: "low"
    policy: "No Escalation"
    auto_escalate: false

Alert Reduction

SLO-Based Alerting

# Alert only when SLO is at risk
groups:
  - name: "slo-alerts"
    rules:
      # Alert when burn rate is high
      - alert: "SLOBurnRateHigh"
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) 
            / 
            sum(rate(http_requests_total[1h]))
          ) 
          / (1 - 0.999) > 1.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SLO burn rate is high"
          description: "At current rate, error budget will be exhausted in {{ $value | mul 30 | round 1 }} days"
      
      # Alert when budget exhausted
      - alert: "SLOBudgetExhausted"
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30d])) 
            / 
            sum(rate(http_requests_total[30d]))
          ) 
          / (1 - 0.999) >= 1
        labels:
          severity: critical
        annotations:
          summary: "SLO budget exhausted!"
          description: "Error budget has been exhausted. Immediate action required."