Skip to main content
โšก Calmops

Incident Management: Responding to Production Outages

Introduction

Production incidents are inevitable. The difference between resilient and fragile organizations isn’t whether incidents happen โ€” it’s how quickly they’re detected, how effectively they’re resolved, and how well the team learns from them.

This guide covers the full incident lifecycle: from detection to postmortem, with practical templates and tools.

Incident Severity Classification

Clear severity levels prevent ambiguity during stressful incidents:

Severity Definition Response Time Example
SEV1 (Critical) Complete service outage, data loss, security breach Immediate (< 5 min) Site down, payment processing failed
SEV2 (High) Major feature broken, significant user impact < 15 minutes Login broken for 20% of users
SEV3 (Medium) Degraded performance, minor feature broken < 1 hour Slow search, non-critical feature down
SEV4 (Low) Minor issue, workaround available Next business day UI glitch, non-critical error spike

The Incident Response Process

Phase 1: Detection

Incidents are detected through:

  • Monitoring alerts (Prometheus, Datadog, PagerDuty)
  • User reports (support tickets, social media)
  • Internal discovery (engineer notices something wrong)

Goal: Detect before users do. Invest in monitoring.

Phase 2: Triage (First 5 Minutes)

1. Acknowledge the alert
2. Assess impact: How many users? Which features? Since when?
3. Classify severity
4. Declare incident if SEV1/SEV2
5. Page the on-call engineer if not already paged

Phase 3: Declare and Mobilize

For SEV1/SEV2, immediately:

1. Create incident channel: #incident-2026-03-30-payment-outage
2. Assign Incident Commander (IC)
3. Post initial status update
4. Start incident timeline document

Phase 4: Investigate and Mitigate

1. Identify symptoms (what's broken)
2. Form hypotheses (why it's broken)
3. Test hypotheses (check logs, metrics, traces)
4. Apply mitigation (rollback, feature flag, hotfix)
5. Verify mitigation worked

Mitigation vs Fix: Mitigation restores service quickly (rollback, disable feature). The permanent fix comes later.

Phase 5: Resolution

1. Confirm service is restored
2. Monitor for 15-30 minutes to ensure stability
3. Declare incident resolved
4. Send all-clear communication
5. Schedule postmortem

Incident Roles

Incident Commander (IC)

The IC coordinates the response โ€” they don’t necessarily fix the problem, they ensure the right people are working on it effectively.

Responsibilities:

  • Maintain situational awareness
  • Make decisions when the team is stuck
  • Manage communication (internal and external)
  • Keep the incident timeline
  • Prevent “too many cooks” chaos

Technical Lead

Leads the technical investigation and remediation. Reports status to the IC.

Communications Lead

For SEV1 incidents, a dedicated person handles:

  • Status page updates
  • Customer communications
  • Stakeholder updates
  • Social media monitoring

Scribe

Documents the incident timeline in real-time:

14:32 - Alert fired: payment_error_rate > 5%
14:33 - On-call acknowledged
14:35 - IC declared SEV1, created #incident channel
14:38 - Hypothesis: recent deploy to payment-service
14:42 - Confirmed: deploy at 14:15 introduced regression
14:45 - Rollback initiated
14:52 - Rollback complete, error rate returning to normal
15:00 - Incident resolved, monitoring

Communication Templates

Initial Incident Declaration

๐Ÿšจ INCIDENT DECLARED - SEV1

Summary: Payment processing is failing for all users
Impact: ~100% of checkout attempts failing since 14:15 UTC
Status: Investigating

IC: @alice
Tech Lead: @bob
Bridge: #incident-2026-03-30-payments

Next update in 15 minutes.

Status Update (Every 15-30 min for SEV1)

๐Ÿ“Š INCIDENT UPDATE - 14:50 UTC

Summary: Payment processing outage
Status: MITIGATING

Progress:
- Root cause identified: deploy at 14:15 introduced null pointer in payment processor
- Rollback initiated at 14:45
- Error rate declining: 80% โ†’ 40% โ†’ 15%

ETA to resolution: ~10 minutes
Next update: 15:05 UTC

Resolution Notice

โœ… INCIDENT RESOLVED - 15:02 UTC

Summary: Payment processing outage
Duration: 47 minutes (14:15 - 15:02 UTC)
Impact: ~100% of payment attempts failed during this window

Resolution: Rolled back payment-service to v2.3.1

Postmortem scheduled: 2026-04-01 10:00 UTC

Status Page

Maintain a public status page (Statuspage.io, Cachet, or custom):

Operational โœ…
Degraded Performance โš ๏ธ
Partial Outage ๐ŸŸก
Major Outage ๐Ÿ”ด
Under Maintenance ๐Ÿ”ง

Update it within 5 minutes of declaring a SEV1/SEV2 incident.

Runbooks

Runbooks are step-by-step guides for common incidents. They reduce cognitive load during stressful situations:

# Runbook: High Database CPU

## Symptoms
- DB CPU > 80% for > 5 minutes
- Slow query alerts firing
- Application response times increasing

## Immediate Actions
1. Check slow query log: `SHOW PROCESSLIST;`
2. Kill long-running queries if safe: `KILL QUERY <id>;`
3. Check for missing indexes: `EXPLAIN <slow query>`

## Escalation
- If CPU > 95% for > 10 min: page DBA team
- If queries can't be killed: consider read replica failover

## Root Cause Investigation
- Check recent deploys for new queries
- Review query patterns in APM tool
- Check for lock contention: `SHOW ENGINE INNODB STATUS;`

Postmortem Process

A blameless postmortem focuses on systems and processes, not individuals.

Postmortem Template

# Postmortem: Payment Processing Outage
**Date:** 2026-03-30
**Duration:** 47 minutes
**Severity:** SEV1
**Author:** @alice

## Summary
A deploy to payment-service introduced a null pointer exception that caused
100% of payment attempts to fail for 47 minutes.

## Impact
- ~2,400 failed payment attempts
- Estimated revenue impact: $48,000
- 0 data loss or security issues

## Timeline
| Time (UTC) | Event |
|---|---|
| 14:15 | payment-service v2.4.0 deployed |
| 14:32 | Alert: payment_error_rate > 5% |
| 14:33 | On-call acknowledged |
| 14:35 | SEV1 declared |
| 14:42 | Root cause identified: null pointer in PaymentProcessor |
| 14:45 | Rollback initiated |
| 15:02 | Service restored |

## Root Cause
The v2.4.0 deploy introduced a change to PaymentProcessor that assumed
`customer.paymentMethod` was always non-null. For customers with no saved
payment method, this caused a NullPointerException.

## Contributing Factors
1. No integration test covering the null payment method case
2. Staging environment doesn't have customers without payment methods
3. Deploy went to 100% of traffic immediately (no canary)

## What Went Well
- Alert fired within 17 minutes of deploy
- Root cause identified quickly via distributed tracing
- Rollback was straightforward and fast

## What Went Poorly
- No canary deployment โ€” full blast radius immediately
- Missing test case for null payment method
- Status page updated 8 minutes after declaration (should be < 5 min)

## Action Items
| Action | Owner | Due Date |
|---|---|---|
| Add null check test for PaymentProcessor | @bob | 2026-04-06 |
| Implement canary deployments for payment-service | @carol | 2026-04-15 |
| Add payment method null to staging test data | @dave | 2026-04-06 |
| Update status page runbook to < 5 min | @alice | 2026-04-01 |

On-Call Best Practices

Rotation Design

- Minimum 1 week rotations (shorter = too much context switching)
- Maximum 2 weeks (longer = burnout)
- Business hours primary + 24/7 secondary
- Clear escalation path: primary โ†’ secondary โ†’ manager

Reducing Alert Fatigue

1. Every alert should be actionable โ€” if you can't do anything, remove it
2. Tune thresholds โ€” alert on symptoms, not causes
3. Group related alerts โ€” one page for "payment system degraded"
4. Track alert frequency โ€” high-frequency alerts need fixing, not silencing
5. Post-incident: review which alerts fired and which should have

On-Call Wellness

  • Compensate on-call time (extra pay or time off)
  • No meetings the day after a night incident
  • Track on-call burden per person
  • Invest in automation to reduce toil

Tools

Category Open Source Commercial
Alerting Alertmanager, Grafana PagerDuty, OpsGenie
Status Page Cachet, Statuspal Statuspage.io, Atlassian
Incident Management Rootly (OSS) PagerDuty, Incident.io
Runbooks Confluence, Notion Runbook.io
Postmortems GitHub Issues Blameless, Jeli

Resources

Comments