Incident Management: Building Effective On-Call and Response Practices

Introduction

Every production system will experience incidents. The difference between organizations that recover quickly and those that struggle lies in their incident management practices. This guide covers building effective incident response processes, on-call rotations, and creating a culture that learns from failures.

Incident management is the discipline of responding to and recovering from service disruptions efficiently and effectively.

Incident Lifecycle

Phases

┌─────────────────────────────────────────────────────────────┐
│                   Incident Lifecycle                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐  │
│  │ Detect  │──▶│ Respond │──▶│ Resolve │──▶│ Review   │  │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘  │
│       │             │             │              │        │
│       ▼             ▼             ▼              ▼        │
│  Alerting      Escalation    Recovery        Learning     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Detection

# Prometheus alerting rules
groups:
  - name: payment-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(payment_requests_total{status=~"5.."}[5m])) 
          / sum(rate(payment_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on payment service"
          description: "Error rate is {{ $value | humanizePercentage }}"
          
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(payment_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on payment service"

On-Call Practices

Building Rotations

# On-call rotation configuration
oncall_config = {
    "primary": {
        "engineer": "[email protected]",
        "phone": "+1234567890",
        "hours": "Monday 9am - Monday 9am"
    },
    "secondary": {
        "engineer": "[email protected]",
        "hours": "Always available for escalation"
    },
    "rotation": {
        "type": "weekly",
        "start": "Monday 9am UTC",
        "handoff_meeting": "Friday 3pm"
    }
}

Escalation Policy

# Escalation tiers
escalation_policy:
  - level: 1
    name: "On-Call Engineer"
    response_time: "15 minutes"
    contacts:
      - phone: oncall_phone
      - slack: "#oncall"
      
  - level: 2
    name: "Team Lead"
    response_time: "30 minutes"
    contacts:
      - phone: team_lead_phone
      - slack: "#engineering"
      
  - level: 3
    name: "Engineering Manager"
    response_time: "1 hour"
    contacts:
      - phone: manager_phone
      - slack: "#leadership"

  - level: 4
    name: "VP Engineering"
    response_time: "2 hours"
    contacts:
      - phone: vp_phone

Runbooks

# Runbook: Payment Service High Error Rate

## Symptoms
- Alert: HighErrorRate on payment-service
- Error rate > 5%

## Impact
- Users cannot complete payments
- Revenue impact immediately

## Steps

### 1. Check Service Health
```bash
kubectl get pods -n payments
kubectl describe deployment payment-service

2. Check Recent Deployments

kubectl rollout history deployment payment-service
git log --oneline -10 --since="1 hour ago"

3. Check Logs

kubectl logs -n payments -l app=payment-service --tail=100

4. Common Fixes

If recent deployment:

kubectl rollout undo deployment/payment-service

If database issue:

kubectl exec -it payment-db-0 -- psql -c "SELECT * FROM pg_stat_activity"

If third-party outage:

Check payment provider status page
Enable circuit breaker

5. Escalate if:

Issue persists > 30 minutes
Database investigation needed
Need to rollback

Contact

Primary: @oncall
Secondary: @payments-team-lead
Emergency: @engineering-manager


## Incident Communication

### Status Page

```yaml
# Automated status page updates
status_page:
  components:
    - name: "Payment Processing"
      status: "operational"  # operational, degraded, outage
    - name: "User Authentication"
      status: "operational"
      
  incidents:
    - id: "inc-123"
      status: "investigating"
      title: "Payment Processing Issues"
      body: "We are currently investigating reports of payment failures"
      updates:
        - status: "identified"
          body: "Root cause identified as database connection issue"
          created_at: "2026-03-12T10:30:00Z"
        - status: "resolved"
          body: "Issue has been resolved"
          created_at: "2026-03-12T11:00:00Z"

Incident Comms Template

# Incident Communication Template

## Initial Alert (within 15 min)
**Subject:** [INCIDENT] Payment Service Outage - Investigating

**Status:** 🔴 Investigating

**Impact:** Users unable to complete purchases

**What we're doing:** Looking into elevated error rates

**Next update:** In 30 minutes

---

## Update (every 30 min)

**Subject:** [INCIDENT] Payment Service Outage - Update #2

**Status:** 🟡 Identified

**Root cause:** Database connection pool exhausted due to connection leak

**What we're doing:** 
- Rolling back to previous version
- Clearing stuck connections

**ETA:** 15 minutes

**Next update:** In 15 minutes

---

## Resolution

**Subject:** [INCIDENT] Payment Service - Resolved

**Status:** 🟢 Resolved

**Summary:** Deployed fix for connection leak, service recovering

**Root cause:** Connection leak in payment processor library

**Action items:**
- [ ] Monitor connection pool metrics
- [ ] Add circuit breaker
- [ ] Schedule post-mortem

**Learnings:** Post-mortem to follow within 48 hours

Post-Incident Reviews

Blameless Post-Mortem Template

# Post-Mortem: Payment Service Outage
**Date:** March 12, 2026
**Duration:** 45 minutes
**Severity:** SEV-1

## Summary
Payment service experienced complete outage for 45 minutes affecting all transactions.

## Timeline (UTC)
- 10:15 - Alert fired: High error rate
- 10:17 - On-call acknowledged
- 10:22 - Root cause identified: Connection pool exhausted
- 10:35 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - All systems operational

## Root Cause
Connection leak in payment processor library introduced in v2.3.0. 
Connections were not being returned to pool after successful transactions.

## Impact
- 2,347 failed transactions
- ~$50,000 in estimated lost revenue
- 45 minutes of degraded service

## What Went Well
- Alert fired quickly (< 2 min from issue start)
- Team responded promptly
- Rollback worked correctly

## What Could Be Improved
- Add connection pool monitoring before deployment
- Better staging environment parity
- Faster escalation path for DB issues

## Action Items
| Task | Owner | Due |
|------|-------|-----|
| Add connection_pool_available SLI | @alice | Mar 15 |
| Add circuit breaker to payment calls | @bob | Mar 20 |
| Improve staging DB load testing | @carol | Apr 1 |
| Review all library upgrades | @team | Ongoing |

Metrics and Monitoring

Key Metrics

Metric	Target	Why
MTTR	< 30 min	Minimize impact
MTTD	< 5 min	Detect fast
MTBF	> 7 days	System reliability
False positive rate	< 5%	Trust in alerts

Alert Quality

# Alert effectiveness metrics
alert_metrics = {
    "total_alerts": 100,
    "actionable_alerts": 85,
    "false_positives": 5,
    "noise_rate": "5%",
    "time_to_acknowledge_avg": "3 min",
    "time_to_resolve_avg": "15 min"
}

Best Practices

Automate detection: Don’t rely on user reports
Respond quickly: Acknowledge within 15 minutes
Communicate transparently: Keep stakeholders informed
Document everything: Capture timeline and actions
Review blamelessly: Focus on systems, not people
Continuously improve: Act on learnings

Conclusion

Effective incident management minimizes impact and builds team confidence. By investing in detection, response processes, and learning from failures, organizations can continuously improve their reliability and customer trust.