Introduction
Production incidents happen. How you respond determines whether they become learning opportunities or recurring problems. This guide covers the complete incident management lifecycle from preparation to post-mortem.
Incident Fundamentals
What Is an Incident?
An unplanned interruption or reduction in service quality:
- Service outage
- Performance degradation
- Data loss
- Security breach
Severity Levels
| Severity | Impact | Example | Response Time |
|---|---|---|---|
| SEV1 | Critical - Full outage | All users affected | Immediate |
| SEV2 | Major - Partial outage | 50% users affected | 15 min |
| SEV3 | Moderate - Degradation | Slow response times | 1 hour |
| SEV4 | Minor - Small impact | Few users affected | 4 hours |
Preparation
Monitoring
Essential metrics to track:
- Availability: Is the service up?
- Latency: How fast?
- Errors: What’s failing?
- Saturation: Any resource exhaustion?
Tools
- Prometheus: Metrics
- Grafana: Visualization
- Datadog: APM + monitoring
- PagerDuty: Alerting
On-Call Setup
# PagerDuty schedule example
schedule:
rotation:
- name: Primary
users: [alice, bob]
start: Mon 9am
- name: Secondary
users: [charlie]
escalation:
- level: 1
after: 15 minutes
- level: 2
after: 30 minutes
Runbooks
Documented procedures:
# Database Connection Errors Runbook
## Symptoms
- 5xx errors on API
- Database connection pool exhausted
## Diagnosis
1. Check database CPU: `SHOW PROCESSLIST`
2. Check connections: `SELECT count(*) FROM pg_stat_activity`
## Resolution
1. Kill long-running queries
2. Scale up database
3. Restart app pods if needed
## Prevention
- Add connection pool limits
- Set query timeouts
- Add monitoring alerts
Detection
Alerting Strategy
- High signal: Few false positives
- Actionable: Clear next steps
- Timely: Fast enough to respond
Alert Types
# Good alerts
- name: High Error Rate
condition: error_rate > 0.05 for 5 minutes
action: Page on-call
- name: High Latency
condition: p99_latency > 2s for 10 minutes
action: Create incident
# Avoid
- name: Metric Spike
condition: any spike
action: ignore
Response
Incident Commander
Lead the response:
- Coordinate all efforts
- Make decisions
- Communicate status
- Delegate tasks
Roles
| Role | Responsibility |
|---|---|
| IC | Overall coordination |
| Lead | Technical investigation |
| Comms | Internal/external comms |
| Scribe | Document timeline |
Communication
Internal
- Incident channel
- Regular updates
- Status page updates
External
- Status page
- Customer emails
- Social media
Example Update
[UPDATE] 2:45 PM PST
We are investigating elevated error rates on our API.
Impact: ~15% of requests failing
Team: Working on identifying root cause
ETA: 30 minutes to next update
Resolution
Investigation
- Gather data: Logs, metrics, traces
- Identify changes: Recent deploys, config
- Hypothesis: What’s likely cause?
- Test: Validate hypothesis
- Fix: Implement solution
Common Fixes
| Issue | Quick Fix | Long-term |
|---|---|---|
| Memory leak | Restart pods | Fix leak |
| DB overload | Scale up | Optimize queries |
| Bad deploy | Rollback | Better testing |
| External | Failover | Redundancy |
Rollback
# Kubernetes rollback
kubectl rollout undo deployment/app
# Git revert
git revert HEAD
git push
# Feature flag
feature_flag.disable('new_feature')
Post-Mortem
What Is a Post-Mortem?
Blameless analysis after incident:
- What happened
- Why it happened
- Impact
- How to prevent recurrence
Template
# Incident Post-Mortem
## Summary
Brief overview of the incident.
## Impact
- Users affected: 10,000
- Duration: 45 minutes
- Revenue impact: $X
## Root Cause
Detailed explanation.
## Timeline
- 10:00 Alert triggered
- 10:15 Incident declared
- 10:30 Root cause identified
- 10:45 Fix deployed
## Action Items
- [ ] Add alert for X (Owner: Alice, Due: Jan 15)
- [ ] Improve database monitoring (Owner: Bob, Due: Jan 20)
- [ ] Update runbook (Owner: Charlie, Due: Jan 10)
Blameless Culture
- Focus on systems, not people
- Ask “what” not “who”
- Share learnings
- Improve together
Prevention
Lessons to Actions
| Finding | Action |
|---|---|
| Missing alert | Add monitoring |
| Slow detection | Improve alerts |
| Unclear process | Update runbook |
| Bad rollback | Test rollbacks |
Continuous Improvement
- Regular incident reviews
- Game days (chaos testing)
- Regular improvements
- Share learnings
On-Call Best Practices
Being On-Call
- Know your systems
- Have access ready
- Document findings
- Hand off properly
Supporting On-Call
- Clear escalation paths
- Reasonable schedules
- Training
- Recognition
Tools
Incident Management
- PagerDuty: Alerting and on-call
- OpsGenie: Alert management
- VictorOps: Incident response
- FireHydrant: Modern incidents
Post-Incident
- Confluence: Documentation
- Notion: Knowledge base
- GitHub Issues: Action tracking
Conclusion
Effective incident management requires preparation, clear processes, and a learning mindset. Build robust monitoring, train your teams, and use incidents as opportunities to improve your systems.
Comments