Skip to main content

Incident Management: Handling Production Outages

Created: March 9, 2026 CalmOps 4 min read

Introduction

Production incidents happen. How you respond determines whether they become learning opportunities or recurring problems. This guide covers the complete incident management lifecycle from preparation to post-mortem.

Incident Fundamentals

What Is an Incident?

An unplanned interruption or reduction in service quality:

  • Service outage
  • Performance degradation
  • Data loss
  • Security breach

Severity Levels

Severity Impact Example Response Time
SEV1 Critical - Full outage All users affected Immediate
SEV2 Major - Partial outage 50% users affected 15 min
SEV3 Moderate - Degradation Slow response times 1 hour
SEV4 Minor - Small impact Few users affected 4 hours

Preparation

Monitoring

Essential metrics to track:

  • Availability: Is the service up?
  • Latency: How fast?
  • Errors: What’s failing?
  • Saturation: Any resource exhaustion?

Tools

  • Prometheus: Metrics
  • Grafana: Visualization
  • Datadog: APM + monitoring
  • PagerDuty: Alerting

On-Call Setup

## PagerDuty schedule example
schedule:
  rotation:
    - name: Primary
      users: [alice, bob]
      start: Mon 9am
    - name: Secondary  
      users: [charlie]
  escalation:
    - level: 1
      after: 15 minutes
    - level: 2
      after: 30 minutes

Runbooks

Documented procedures:

## Database Connection Errors Runbook

## Symptoms
- 5xx errors on API
- Database connection pool exhausted

## Diagnosis
1. Check database CPU: `SHOW PROCESSLIST`
2. Check connections: `SELECT count(*) FROM pg_stat_activity`

## Resolution
1. Kill long-running queries
2. Scale up database
3. Restart app pods if needed

## Prevention
- Add connection pool limits
- Set query timeouts
- Add monitoring alerts

Detection

Alerting Strategy

  • High signal: Few false positives
  • Actionable: Clear next steps
  • Timely: Fast enough to respond

Alert Types

## Good alerts
- name: High Error Rate
  condition: error_rate > 0.05 for 5 minutes
  action: Page on-call

- name: High Latency
  condition: p99_latency > 2s for 10 minutes
  action: Create incident

## Avoid
- name: Metric Spike
  condition: any spike
  action: ignore

Response

Incident Commander

Lead the response:

  • Coordinate all efforts
  • Make decisions
  • Communicate status
  • Delegate tasks

Roles

Role Responsibility
IC Overall coordination
Lead Technical investigation
Comms Internal/external comms
Scribe Document timeline

Communication

Internal

  • Incident channel
  • Regular updates
  • Status page updates

External

  • Status page
  • Customer emails
  • Social media

Example Update

[UPDATE] 2:45 PM PST
We are investigating elevated error rates on our API.
Impact: ~15% of requests failing
Team: Working on identifying root cause
ETA: 30 minutes to next update

Resolution

Investigation

  1. Gather data: Logs, metrics, traces
  2. Identify changes: Recent deploys, config
  3. Hypothesis: What’s likely cause?
  4. Test: Validate hypothesis
  5. Fix: Implement solution

Common Fixes

Issue Quick Fix Long-term
Memory leak Restart pods Fix leak
DB overload Scale up Optimize queries
Bad deploy Rollback Better testing
External Failover Redundancy

Rollback

## Kubernetes rollback
kubectl rollout undo deployment/app

## Git revert
git revert HEAD
git push

## Feature flag
feature_flag.disable('new_feature')

Post-Mortem

What Is a Post-Mortem?

Blameless analysis after incident:

  • What happened
  • Why it happened
  • Impact
  • How to prevent recurrence

Template

## Incident Post-Mortem

## Summary
Brief overview of the incident.

## Impact
- Users affected: 10,000
- Duration: 45 minutes
- Revenue impact: $X

## Root Cause
Detailed explanation.

## Timeline
- 10:00 Alert triggered
- 10:15 Incident declared
- 10:30 Root cause identified
- 10:45 Fix deployed

## Action Items
- [ ] Add alert for X (Owner: Alice, Due: Jan 15)
- [ ] Improve database monitoring (Owner: Bob, Due: Jan 20)
- [ ] Update runbook (Owner: Charlie, Due: Jan 10)

Blameless Culture

  • Focus on systems, not people
  • Ask “what” not “who”
  • Share learnings
  • Improve together

Prevention

Lessons to Actions

Finding Action
Missing alert Add monitoring
Slow detection Improve alerts
Unclear process Update runbook
Bad rollback Test rollbacks

Continuous Improvement

  • Regular incident reviews
  • Game days (chaos testing)
  • Regular improvements
  • Share learnings

On-Call Best Practices

Being On-Call

  • Know your systems
  • Have access ready
  • Document findings
  • Hand off properly

Supporting On-Call

  • Clear escalation paths
  • Reasonable schedules
  • Training
  • Recognition

Tools

Incident Management

  • PagerDuty: Alerting and on-call
  • OpsGenie: Alert management
  • VictorOps: Incident response
  • FireHydrant: Modern incidents

Post-Incident

  • Confluence: Documentation
  • Notion: Knowledge base
  • GitHub Issues: Action tracking

Conclusion

Effective incident management requires preparation, clear processes, and a learning mindset. Build robust monitoring, train your teams, and use incidents as opportunities to improve your systems.


Resources

Comments

Share this article

Scan to read on mobile