Skip to main content
โšก Calmops

Alerting Strategy: Reducing Alert Fatigue and Building Effective Alerts

Alerting Strategy: Reducing Alert Fatigue and Building Effective Alerts

TL;DR: This guide covers building an effective alerting strategy. Learn alert types, severity levels, runbooks, reducing alert fatigue, and creating actionable alerts.


Introduction

Effective alerting is critical but challenging:

  • Too many alerts โ†’ Alert fatigue
  • Too few alerts โ†’ Missed incidents
  • Poor alerts โ†’ Wasted time

Goal: Actionable alerts that require immediate response


Alert Types

1. Prometheus-Based Alerts

groups:
  - name: application-alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
          
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"

2. Recording Rules

groups:
  - name: application-recording
    rules:
      # Recording rule for common queries
      - record: service:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (service)
          
      - record: service:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          
      - record: service:http_latency:p95
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (service)

Severity Levels

Severity Definition

Severity Description Response Time Examples
P1 - Critical Service down, data loss < 15 min Service unavailable, security breach
P2 - High Major impact, degradation < 1 hour High error rate, slow response
P3 - Medium Minor impact < 4 hours Slow queries, minor issues
P4 - Low Informational Next business day Disk space warning, deprecation

Severity Configuration

labels:
  severity: critical
  # routing
  team: on-call
  escalation_policy: default

Runbooks

Runbook Structure

# Runbook: High Error Rate

## Description
When error rate exceeds 5% for 5 minutes.

## Impact
- Users experiencing failures
- Potential data loss

## Symptoms
- HTTP 5xx errors
- Customer complaints

## Verification Steps
1. Check Kibana for error logs
2. Review recent deployments
3. Check database connectivity

## Remediation Steps
1. Identify error pattern in logs
2. If caused by recent deploy:
   a. Roll back to previous version
   b. Deploy fix
3. If database issue:
   a. Check connection pool
   b. Restart database if needed

## Rollback Steps
kubectl rollout undo deployment/app

Linking Runbooks

annotations:
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/123"
  message: "Check error logs in Kibana"

Reducing Alert Fatigue

1. Use for: Clause

# Alert only if condition persists for 5 minutes
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m  # Prevents spikes from triggering alerts

2. Noise Reduction Techniques

# Group similar alerts
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m
  labels:
    group: application
  annotations:
    # Use $labels to see all affected instances
    summary: "High error rate on {{ $labels.service }}"

3. Dead Man’s Switch

# Alert if monitoring stops
- alert: MonitoringDown
  expr: up{job="prometheus"} == 0
  for: 5m
  labels:
    severity: critical

4. Repeat Intervals

# Don't repeat critical alerts too often
- alert: ServiceDown
  expr: up{job="my-service"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    # Repeat every hour, not every alert
    repeat_interval: 1h

Alert Routing

Routing Configuration

route:
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default
  
  routes:
    # Critical alerts to PagerDuty
    - match:
        severity: critical
      receiver: pagerduty
      continue: true
      
    # Warning alerts to Slack
    - match:
        severity: warning
      receiver: slack
      
    # Database alerts to DBA team
    - match:
        component: database
      receiver: dba-team
      
receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: 'xxx'
        severity: critical
        
  - name: slack
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'

On-Call Best Practices

Escalation Policy

# Escalation policy
escalation_policies:
  - name: default
    steps:
      - delay: 15m
        receiver: on-call-primary
      - delay: 30m
        receiver: on-call-secondary  
      - delay: 1h
        receiver: manager

Handoff Process

  1. Pre-handoff: Review active alerts
  2. Handoff meeting: 15-minute overlap
  3. Transfer: Explicitly transfer ownership
  4. Documentation: Update on-call log

Metrics for Alerting

Alert Quality Metrics

Metric Target Description
Alerts per day < 50 Manageable alert volume
Alert accuracy > 90% Alerts requiring action
MTTR < 1 hour Time to resolution
False positive rate < 10% Alerts with no action
On-call load < 8 hours/week Sustainable pace

Conclusion

Effective alerting requires:

  1. Clear severity levels - Match response to impact
  2. Actionable alerts - Each alert needs response
  3. Runbooks - Document remediation
  4. Reduce noise - Use for: and grouping
  5. Continuous improvement - Review and tune alerts

External Resources


Comments