Introduction
Every production system will experience incidents. The difference between organizations that recover quickly and those that struggle lies in their incident management practices. This guide covers building effective incident response processes, on-call rotations, and creating a culture that learns from failures.
Incident management is the discipline of responding to and recovering from service disruptions efficiently and effectively.
Incident Lifecycle
Phases
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Incident Lifecycle โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Detect โโโโถโ Respond โโโโถโ Resolve โโโโถโ Review โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โผ โผ โผ โผ โ
โ Alerting Escalation Recovery Learning โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Detection
# Prometheus alerting rules
groups:
- name: payment-service
rules:
- alert: HighErrorRate
expr: |
sum(rate(payment_requests_total{status=~"5.."}[5m]))
/ sum(rate(payment_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on payment service"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(payment_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on payment service"
On-Call Practices
Building Rotations
# On-call rotation configuration
oncall_config = {
"primary": {
"engineer": "[email protected]",
"phone": "+1234567890",
"hours": "Monday 9am - Monday 9am"
},
"secondary": {
"engineer": "[email protected]",
"hours": "Always available for escalation"
},
"rotation": {
"type": "weekly",
"start": "Monday 9am UTC",
"handoff_meeting": "Friday 3pm"
}
}
Escalation Policy
# Escalation tiers
escalation_policy:
- level: 1
name: "On-Call Engineer"
response_time: "15 minutes"
contacts:
- phone: oncall_phone
- slack: "#oncall"
- level: 2
name: "Team Lead"
response_time: "30 minutes"
contacts:
- phone: team_lead_phone
- slack: "#engineering"
- level: 3
name: "Engineering Manager"
response_time: "1 hour"
contacts:
- phone: manager_phone
- slack: "#leadership"
- level: 4
name: "VP Engineering"
response_time: "2 hours"
contacts:
- phone: vp_phone
Runbooks
# Runbook: Payment Service High Error Rate
## Symptoms
- Alert: HighErrorRate on payment-service
- Error rate > 5%
## Impact
- Users cannot complete payments
- Revenue impact immediately
## Steps
### 1. Check Service Health
```bash
kubectl get pods -n payments
kubectl describe deployment payment-service
2. Check Recent Deployments
kubectl rollout history deployment payment-service
git log --oneline -10 --since="1 hour ago"
3. Check Logs
kubectl logs -n payments -l app=payment-service --tail=100
4. Common Fixes
If recent deployment:
kubectl rollout undo deployment/payment-service
If database issue:
kubectl exec -it payment-db-0 -- psql -c "SELECT * FROM pg_stat_activity"
If third-party outage:
- Check payment provider status page
- Enable circuit breaker
5. Escalate if:
- Issue persists > 30 minutes
- Database investigation needed
- Need to rollback
Contact
- Primary: @oncall
- Secondary: @payments-team-lead
- Emergency: @engineering-manager
## Incident Communication
### Status Page
```yaml
# Automated status page updates
status_page:
components:
- name: "Payment Processing"
status: "operational" # operational, degraded, outage
- name: "User Authentication"
status: "operational"
incidents:
- id: "inc-123"
status: "investigating"
title: "Payment Processing Issues"
body: "We are currently investigating reports of payment failures"
updates:
- status: "identified"
body: "Root cause identified as database connection issue"
created_at: "2026-03-12T10:30:00Z"
- status: "resolved"
body: "Issue has been resolved"
created_at: "2026-03-12T11:00:00Z"
Incident Comms Template
# Incident Communication Template
## Initial Alert (within 15 min)
**Subject:** [INCIDENT] Payment Service Outage - Investigating
**Status:** ๐ด Investigating
**Impact:** Users unable to complete purchases
**What we're doing:** Looking into elevated error rates
**Next update:** In 30 minutes
---
## Update (every 30 min)
**Subject:** [INCIDENT] Payment Service Outage - Update #2
**Status:** ๐ก Identified
**Root cause:** Database connection pool exhausted due to connection leak
**What we're doing:**
- Rolling back to previous version
- Clearing stuck connections
**ETA:** 15 minutes
**Next update:** In 15 minutes
---
## Resolution
**Subject:** [INCIDENT] Payment Service - Resolved
**Status:** ๐ข Resolved
**Summary:** Deployed fix for connection leak, service recovering
**Root cause:** Connection leak in payment processor library
**Action items:**
- [ ] Monitor connection pool metrics
- [ ] Add circuit breaker
- [ ] Schedule post-mortem
**Learnings:** Post-mortem to follow within 48 hours
Post-Incident Reviews
Blameless Post-Mortem Template
# Post-Mortem: Payment Service Outage
**Date:** March 12, 2026
**Duration:** 45 minutes
**Severity:** SEV-1
## Summary
Payment service experienced complete outage for 45 minutes affecting all transactions.
## Timeline (UTC)
- 10:15 - Alert fired: High error rate
- 10:17 - On-call acknowledged
- 10:22 - Root cause identified: Connection pool exhausted
- 10:35 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - All systems operational
## Root Cause
Connection leak in payment processor library introduced in v2.3.0.
Connections were not being returned to pool after successful transactions.
## Impact
- 2,347 failed transactions
- ~$50,000 in estimated lost revenue
- 45 minutes of degraded service
## What Went Well
- Alert fired quickly (< 2 min from issue start)
- Team responded promptly
- Rollback worked correctly
## What Could Be Improved
- Add connection pool monitoring before deployment
- Better staging environment parity
- Faster escalation path for DB issues
## Action Items
| Task | Owner | Due |
|------|-------|-----|
| Add connection_pool_available SLI | @alice | Mar 15 |
| Add circuit breaker to payment calls | @bob | Mar 20 |
| Improve staging DB load testing | @carol | Apr 1 |
| Review all library upgrades | @team | Ongoing |
Metrics and Monitoring
Key Metrics
| Metric | Target | Why |
|---|---|---|
| MTTR | < 30 min | Minimize impact |
| MTTD | < 5 min | Detect fast |
| MTBF | > 7 days | System reliability |
| False positive rate | < 5% | Trust in alerts |
Alert Quality
# Alert effectiveness metrics
alert_metrics = {
"total_alerts": 100,
"actionable_alerts": 85,
"false_positives": 5,
"noise_rate": "5%",
"time_to_acknowledge_avg": "3 min",
"time_to_resolve_avg": "15 min"
}
Best Practices
- Automate detection: Don’t rely on user reports
- Respond quickly: Acknowledge within 15 minutes
- Communicate transparently: Keep stakeholders informed
- Document everything: Capture timeline and actions
- Review blamelessly: Focus on systems, not people
- Continuously improve: Act on learnings
Conclusion
Effective incident management minimizes impact and builds team confidence. By investing in detection, response processes, and learning from failures, organizations can continuously improve their reliability and customer trust.
Comments