Alerting Strategy: Reducing Alert Fatigue and Building Effective Alerts
TL;DR: This guide covers building an effective alerting strategy. Learn alert types, severity levels, runbooks, reducing alert fatigue, and creating actionable alerts.
Introduction
Effective alerting is critical but challenging:
- Too many alerts โ Alert fatigue
- Too few alerts โ Missed incidents
- Poor alerts โ Wasted time
Goal: Actionable alerts that require immediate response
Alert Types
1. Prometheus-Based Alerts
groups:
- name: application-alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
2. Recording Rules
groups:
- name: application-recording
rules:
# Recording rule for common queries
- record: service:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (service)
- record: service:http_errors:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
- record: service:http_latency:p95
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (service)
Severity Levels
Severity Definition
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Service down, data loss | < 15 min | Service unavailable, security breach |
| P2 - High | Major impact, degradation | < 1 hour | High error rate, slow response |
| P3 - Medium | Minor impact | < 4 hours | Slow queries, minor issues |
| P4 - Low | Informational | Next business day | Disk space warning, deprecation |
Severity Configuration
labels:
severity: critical
# routing
team: on-call
escalation_policy: default
Runbooks
Runbook Structure
# Runbook: High Error Rate
## Description
When error rate exceeds 5% for 5 minutes.
## Impact
- Users experiencing failures
- Potential data loss
## Symptoms
- HTTP 5xx errors
- Customer complaints
## Verification Steps
1. Check Kibana for error logs
2. Review recent deployments
3. Check database connectivity
## Remediation Steps
1. Identify error pattern in logs
2. If caused by recent deploy:
a. Roll back to previous version
b. Deploy fix
3. If database issue:
a. Check connection pool
b. Restart database if needed
## Rollback Steps
kubectl rollout undo deployment/app
Linking Runbooks
annotations:
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/123"
message: "Check error logs in Kibana"
Reducing Alert Fatigue
1. Use for: Clause
# Alert only if condition persists for 5 minutes
- alert: HighErrorRate
expr: error_rate > 0.05
for: 5m # Prevents spikes from triggering alerts
2. Noise Reduction Techniques
# Group similar alerts
- alert: HighErrorRate
expr: error_rate > 0.05
for: 5m
labels:
group: application
annotations:
# Use $labels to see all affected instances
summary: "High error rate on {{ $labels.service }}"
3. Dead Man’s Switch
# Alert if monitoring stops
- alert: MonitoringDown
expr: up{job="prometheus"} == 0
for: 5m
labels:
severity: critical
4. Repeat Intervals
# Don't repeat critical alerts too often
- alert: ServiceDown
expr: up{job="my-service"} == 0
for: 1m
labels:
severity: critical
annotations:
# Repeat every hour, not every alert
repeat_interval: 1h
Alert Routing
Routing Configuration
route:
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default
routes:
# Critical alerts to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Warning alerts to Slack
- match:
severity: warning
receiver: slack
# Database alerts to DBA team
- match:
component: database
receiver: dba-team
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: 'xxx'
severity: critical
- name: slack
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
On-Call Best Practices
Escalation Policy
# Escalation policy
escalation_policies:
- name: default
steps:
- delay: 15m
receiver: on-call-primary
- delay: 30m
receiver: on-call-secondary
- delay: 1h
receiver: manager
Handoff Process
- Pre-handoff: Review active alerts
- Handoff meeting: 15-minute overlap
- Transfer: Explicitly transfer ownership
- Documentation: Update on-call log
Metrics for Alerting
Alert Quality Metrics
| Metric | Target | Description |
|---|---|---|
| Alerts per day | < 50 | Manageable alert volume |
| Alert accuracy | > 90% | Alerts requiring action |
| MTTR | < 1 hour | Time to resolution |
| False positive rate | < 10% | Alerts with no action |
| On-call load | < 8 hours/week | Sustainable pace |
Conclusion
Effective alerting requires:
- Clear severity levels - Match response to impact
- Actionable alerts - Each alert needs response
- Runbooks - Document remediation
- Reduce noise - Use for: and grouping
- Continuous improvement - Review and tune alerts
Comments