Introduction
Alert fatigue is real. When engineers receive too many alerts, they ignore themโleading to missed critical issues. Effective alerting requires careful design and continuous improvement.
Key Statistics:
- Average engineer receives 100+ alerts daily
- 70% of alerts are not actionable
- Alert fatigue causes 25% of breaches
- SRE teams spend 25% of time on alerts
Alert Hierarchy
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Alert Tiers โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Tier 1: Critical (P1) โ
โ โโโ Service down โ
โ โโโ Data loss imminent โ
โ โโโ Security breach โ
โ โโโ Response: Immediate, automated escalation โ
โ โ
โ Tier 2: High (P2) โ
โ โโโ Performance degradation (>50%) โ
โ โโโ Error rate increase (>5%) โ
โ โโโ Response: Within 15 minutes, on-call paged โ
โ โ
โ Tier 3: Medium (P3) โ
โ โโโ Minor anomalies โ
โ โโโ Capacity warnings (>80%) โ
โ โโโ Response: Within 1 hour, ticket created โ
โ โ
โ Tier 4: Low (P4) โ
โ โโโ Informational โ
โ โโโ Scheduled maintenance โ
โ โโโ Response: Next business day โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Design
Good vs Bad Alerts
# BAD ALERTS:
# โ Alert on every error
- alert: AnyError
expr: rate(errors_total[5m]) > 0
# Problem: False positives, noise
# โ No context
- alert: HighLatency
expr: histogram_quantile(0.99, rate(latency[5m])) > 1
# Problem: No severity, no action
# โ Fragile conditions
- alert: PodRestarted
expr: changes(kube_pod_container_status_restarts[5m]) > 0
# Problem: Too sensitive
# GOOD ALERTS:
# โ
Clear severity
- alert: ServiceOutage
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "API service is down"
description: "API has been down for 2 minutes"
runbook_url: "https://wiki/runbooks/service-outage"
# โ
Actionable
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: high
annotations:
summary: "Error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://wiki/runbooks/high-error-rate"
# โ
Meaningful threshold
- alert: DatabaseConnectionPoolExhausted
expr: |
sum(rate(database_connections_acquired_total[5m]))
/
sum(database_max_connections) > 0.9
for: 10m
labels:
severity: high
annotations:
summary: "Database connection pool > 90%"
description: "Pool utilization is {{ $value | humanizePercentage }}"
Runbooks
Runbook Structure
# Runbook Template
runbook:
title: "Service High Latency"
id: "HIGH-LATENCY-001"
severity: "P2"
impact: "Users experiencing slow response times"
triggers:
- "P99 latency > 2 seconds for 5 minutes"
- "P99 latency > 5 seconds for 2 minutes"
steps:
- name: "Check service health"
description: |
1. Check Grafana dashboard: [Link]
2. Look for recent deployments
3. Check error rates
- name: "Check dependencies"
description: |
1. Check if database is slow
2. Check if cache is saturated
3. Check upstream services
- name: "Check capacity"
description: |
1. CPU/Memory utilization
2. Auto-scaling status
3. Database connections
- name: "Remediation"
description: |
1. Scale up if needed: `kubectl scale deployment api --replicas=10`
2. Rollback if recent deploy: `kubectl rollout undo deployment/api`
3. Enable caching if DB is bottleneck
- name: "Communications"
description: |
1. Update status page
2. Notify #ops channel
3. Update on-call if longer than 30 min
automation:
- name: "Auto-scale"
action: |
kubectl autoscale deployment api --min=5 --max=20 --cpu-percent=70
- name: "Auto-rollback"
trigger: "If deployment within last 15 minutes"
action: |
kubectl rollout undo deployment/api
Runbook Automation
#!/usr/bin/env python3
"""Automated runbook execution."""
class RunbookAutomation:
"""Execute runbook steps automatically."""
def __init__(self, k8s_client, alert_manager):
self.k8s = k8s_client
self.alert_manager = alert_manager
def execute_runbook(self, alert):
"""Execute runbook for alert."""
runbook = self.get_runbook(alert.name)
if not runbook:
return {"status": "no_runbook", "message": "No runbook found"}
results = []
for step in runbook.steps:
if step.get("automated"):
result = self.execute_step(step)
results.append({
"step": step["name"],
"status": result["status"],
"output": result.get("output")
})
if result["status"] == "failed":
break
return {
"alert": alert.name,
"runbook": runbook.title,
"steps_executed": results
}
def execute_step(self, step):
"""Execute automated step."""
action = step["action"]
try:
if action.startswith("kubectl"):
output = self.k8s.execute(action)
return {"status": "success", "output": output}
return {"status": "skipped", "message": "Action not implemented"}
except Exception as e:
return {"status": "failed", "error": str(e)}
Escalation
Escalation Policy
# Escalation Policy
escalation_policy:
name: "Standard On-Call Escalation"
levels:
- level: 1
on_call: "Primary On-Call"
timeout: 15m
notify:
- type: "pagerduty"
target: "primary"
- type: "sms"
target: "phone"
- type: "email"
target: "email"
- level: 2
on_call: "Secondary On-Call"
timeout: 15m
notify:
- type: "pagerduty"
target: "secondary"
- level: 3
on_call: "Engineering Manager"
timeout: 30m
notify:
- type: "phone"
- type: "slack"
channel: "#incidents"
- level: 4
on_call: "VP Engineering"
timeout: 60m
notify:
- type: "email"
# Severity-based routing
routing_rules:
- severity: "critical"
policy: "Standard On-Call Escalation"
auto_escalate: true
- severity: "high"
policy: "Standard On-Call Escalation"
auto_escalate: true
- severity: "medium"
policy: "Standard On-Call Escalation"
auto_escalate: false
- severity: "low"
policy: "No Escalation"
auto_escalate: false
Alert Reduction
SLO-Based Alerting
# Alert only when SLO is at risk
groups:
- name: "slo-alerts"
rules:
# Alert when burn rate is high
- alert: "SLOBurnRateHigh"
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/ (1 - 0.999) > 1.5
for: 5m
labels:
severity: warning
annotations:
summary: "SLO burn rate is high"
description: "At current rate, error budget will be exhausted in {{ $value | mul 30 | round 1 }} days"
# Alert when budget exhausted
- alert: "SLOBudgetExhausted"
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
/ (1 - 0.999) >= 1
labels:
severity: critical
annotations:
summary: "SLO budget exhausted!"
description: "Error budget has been exhausted. Immediate action required."
Comments