Introduction
Production incidents are inevitable. The difference between resilient and fragile organizations isn’t whether incidents happen โ it’s how quickly they’re detected, how effectively they’re resolved, and how well the team learns from them.
This guide covers the full incident lifecycle: from detection to postmortem, with practical templates and tools.
Incident Severity Classification
Clear severity levels prevent ambiguity during stressful incidents:
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV1 (Critical) | Complete service outage, data loss, security breach | Immediate (< 5 min) | Site down, payment processing failed |
| SEV2 (High) | Major feature broken, significant user impact | < 15 minutes | Login broken for 20% of users |
| SEV3 (Medium) | Degraded performance, minor feature broken | < 1 hour | Slow search, non-critical feature down |
| SEV4 (Low) | Minor issue, workaround available | Next business day | UI glitch, non-critical error spike |
The Incident Response Process
Phase 1: Detection
Incidents are detected through:
- Monitoring alerts (Prometheus, Datadog, PagerDuty)
- User reports (support tickets, social media)
- Internal discovery (engineer notices something wrong)
Goal: Detect before users do. Invest in monitoring.
Phase 2: Triage (First 5 Minutes)
1. Acknowledge the alert
2. Assess impact: How many users? Which features? Since when?
3. Classify severity
4. Declare incident if SEV1/SEV2
5. Page the on-call engineer if not already paged
Phase 3: Declare and Mobilize
For SEV1/SEV2, immediately:
1. Create incident channel: #incident-2026-03-30-payment-outage
2. Assign Incident Commander (IC)
3. Post initial status update
4. Start incident timeline document
Phase 4: Investigate and Mitigate
1. Identify symptoms (what's broken)
2. Form hypotheses (why it's broken)
3. Test hypotheses (check logs, metrics, traces)
4. Apply mitigation (rollback, feature flag, hotfix)
5. Verify mitigation worked
Mitigation vs Fix: Mitigation restores service quickly (rollback, disable feature). The permanent fix comes later.
Phase 5: Resolution
1. Confirm service is restored
2. Monitor for 15-30 minutes to ensure stability
3. Declare incident resolved
4. Send all-clear communication
5. Schedule postmortem
Incident Roles
Incident Commander (IC)
The IC coordinates the response โ they don’t necessarily fix the problem, they ensure the right people are working on it effectively.
Responsibilities:
- Maintain situational awareness
- Make decisions when the team is stuck
- Manage communication (internal and external)
- Keep the incident timeline
- Prevent “too many cooks” chaos
Technical Lead
Leads the technical investigation and remediation. Reports status to the IC.
Communications Lead
For SEV1 incidents, a dedicated person handles:
- Status page updates
- Customer communications
- Stakeholder updates
- Social media monitoring
Scribe
Documents the incident timeline in real-time:
14:32 - Alert fired: payment_error_rate > 5%
14:33 - On-call acknowledged
14:35 - IC declared SEV1, created #incident channel
14:38 - Hypothesis: recent deploy to payment-service
14:42 - Confirmed: deploy at 14:15 introduced regression
14:45 - Rollback initiated
14:52 - Rollback complete, error rate returning to normal
15:00 - Incident resolved, monitoring
Communication Templates
Initial Incident Declaration
๐จ INCIDENT DECLARED - SEV1
Summary: Payment processing is failing for all users
Impact: ~100% of checkout attempts failing since 14:15 UTC
Status: Investigating
IC: @alice
Tech Lead: @bob
Bridge: #incident-2026-03-30-payments
Next update in 15 minutes.
Status Update (Every 15-30 min for SEV1)
๐ INCIDENT UPDATE - 14:50 UTC
Summary: Payment processing outage
Status: MITIGATING
Progress:
- Root cause identified: deploy at 14:15 introduced null pointer in payment processor
- Rollback initiated at 14:45
- Error rate declining: 80% โ 40% โ 15%
ETA to resolution: ~10 minutes
Next update: 15:05 UTC
Resolution Notice
โ
INCIDENT RESOLVED - 15:02 UTC
Summary: Payment processing outage
Duration: 47 minutes (14:15 - 15:02 UTC)
Impact: ~100% of payment attempts failed during this window
Resolution: Rolled back payment-service to v2.3.1
Postmortem scheduled: 2026-04-01 10:00 UTC
Status Page
Maintain a public status page (Statuspage.io, Cachet, or custom):
Operational โ
Degraded Performance โ ๏ธ
Partial Outage ๐ก
Major Outage ๐ด
Under Maintenance ๐ง
Update it within 5 minutes of declaring a SEV1/SEV2 incident.
Runbooks
Runbooks are step-by-step guides for common incidents. They reduce cognitive load during stressful situations:
# Runbook: High Database CPU
## Symptoms
- DB CPU > 80% for > 5 minutes
- Slow query alerts firing
- Application response times increasing
## Immediate Actions
1. Check slow query log: `SHOW PROCESSLIST;`
2. Kill long-running queries if safe: `KILL QUERY <id>;`
3. Check for missing indexes: `EXPLAIN <slow query>`
## Escalation
- If CPU > 95% for > 10 min: page DBA team
- If queries can't be killed: consider read replica failover
## Root Cause Investigation
- Check recent deploys for new queries
- Review query patterns in APM tool
- Check for lock contention: `SHOW ENGINE INNODB STATUS;`
Postmortem Process
A blameless postmortem focuses on systems and processes, not individuals.
Postmortem Template
# Postmortem: Payment Processing Outage
**Date:** 2026-03-30
**Duration:** 47 minutes
**Severity:** SEV1
**Author:** @alice
## Summary
A deploy to payment-service introduced a null pointer exception that caused
100% of payment attempts to fail for 47 minutes.
## Impact
- ~2,400 failed payment attempts
- Estimated revenue impact: $48,000
- 0 data loss or security issues
## Timeline
| Time (UTC) | Event |
|---|---|
| 14:15 | payment-service v2.4.0 deployed |
| 14:32 | Alert: payment_error_rate > 5% |
| 14:33 | On-call acknowledged |
| 14:35 | SEV1 declared |
| 14:42 | Root cause identified: null pointer in PaymentProcessor |
| 14:45 | Rollback initiated |
| 15:02 | Service restored |
## Root Cause
The v2.4.0 deploy introduced a change to PaymentProcessor that assumed
`customer.paymentMethod` was always non-null. For customers with no saved
payment method, this caused a NullPointerException.
## Contributing Factors
1. No integration test covering the null payment method case
2. Staging environment doesn't have customers without payment methods
3. Deploy went to 100% of traffic immediately (no canary)
## What Went Well
- Alert fired within 17 minutes of deploy
- Root cause identified quickly via distributed tracing
- Rollback was straightforward and fast
## What Went Poorly
- No canary deployment โ full blast radius immediately
- Missing test case for null payment method
- Status page updated 8 minutes after declaration (should be < 5 min)
## Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add null check test for PaymentProcessor | @bob | 2026-04-06 |
| Implement canary deployments for payment-service | @carol | 2026-04-15 |
| Add payment method null to staging test data | @dave | 2026-04-06 |
| Update status page runbook to < 5 min | @alice | 2026-04-01 |
On-Call Best Practices
Rotation Design
- Minimum 1 week rotations (shorter = too much context switching)
- Maximum 2 weeks (longer = burnout)
- Business hours primary + 24/7 secondary
- Clear escalation path: primary โ secondary โ manager
Reducing Alert Fatigue
1. Every alert should be actionable โ if you can't do anything, remove it
2. Tune thresholds โ alert on symptoms, not causes
3. Group related alerts โ one page for "payment system degraded"
4. Track alert frequency โ high-frequency alerts need fixing, not silencing
5. Post-incident: review which alerts fired and which should have
On-Call Wellness
- Compensate on-call time (extra pay or time off)
- No meetings the day after a night incident
- Track on-call burden per person
- Invest in automation to reduce toil
Tools
| Category | Open Source | Commercial |
|---|---|---|
| Alerting | Alertmanager, Grafana | PagerDuty, OpsGenie |
| Status Page | Cachet, Statuspal | Statuspage.io, Atlassian |
| Incident Management | Rootly (OSS) | PagerDuty, Incident.io |
| Runbooks | Confluence, Notion | Runbook.io |
| Postmortems | GitHub Issues | Blameless, Jeli |
Resources
- Google SRE Book: Incident Management
- PagerDuty Incident Response Guide
- Atlassian Incident Management
- Blameless Postmortems
Comments