Skip to main content
โšก Calmops

Incident Management: Handling Production Outages

Introduction

Production incidents happen. How you respond determines whether they become learning opportunities or recurring problems. This guide covers the complete incident management lifecycle from preparation to post-mortem.

Incident Fundamentals

What Is an Incident?

An unplanned interruption or reduction in service quality:

  • Service outage
  • Performance degradation
  • Data loss
  • Security breach

Severity Levels

Severity Impact Example Response Time
SEV1 Critical - Full outage All users affected Immediate
SEV2 Major - Partial outage 50% users affected 15 min
SEV3 Moderate - Degradation Slow response times 1 hour
SEV4 Minor - Small impact Few users affected 4 hours

Preparation

Monitoring

Essential metrics to track:

  • Availability: Is the service up?
  • Latency: How fast?
  • Errors: What’s failing?
  • Saturation: Any resource exhaustion?

Tools

  • Prometheus: Metrics
  • Grafana: Visualization
  • Datadog: APM + monitoring
  • PagerDuty: Alerting

On-Call Setup

# PagerDuty schedule example
schedule:
  rotation:
    - name: Primary
      users: [alice, bob]
      start: Mon 9am
    - name: Secondary  
      users: [charlie]
  escalation:
    - level: 1
      after: 15 minutes
    - level: 2
      after: 30 minutes

Runbooks

Documented procedures:

# Database Connection Errors Runbook

## Symptoms
- 5xx errors on API
- Database connection pool exhausted

## Diagnosis
1. Check database CPU: `SHOW PROCESSLIST`
2. Check connections: `SELECT count(*) FROM pg_stat_activity`

## Resolution
1. Kill long-running queries
2. Scale up database
3. Restart app pods if needed

## Prevention
- Add connection pool limits
- Set query timeouts
- Add monitoring alerts

Detection

Alerting Strategy

  • High signal: Few false positives
  • Actionable: Clear next steps
  • Timely: Fast enough to respond

Alert Types

# Good alerts
- name: High Error Rate
  condition: error_rate > 0.05 for 5 minutes
  action: Page on-call

- name: High Latency
  condition: p99_latency > 2s for 10 minutes
  action: Create incident

# Avoid
- name: Metric Spike
  condition: any spike
  action: ignore

Response

Incident Commander

Lead the response:

  • Coordinate all efforts
  • Make decisions
  • Communicate status
  • Delegate tasks

Roles

Role Responsibility
IC Overall coordination
Lead Technical investigation
Comms Internal/external comms
Scribe Document timeline

Communication

Internal

  • Incident channel
  • Regular updates
  • Status page updates

External

  • Status page
  • Customer emails
  • Social media

Example Update

[UPDATE] 2:45 PM PST
We are investigating elevated error rates on our API.
Impact: ~15% of requests failing
Team: Working on identifying root cause
ETA: 30 minutes to next update

Resolution

Investigation

  1. Gather data: Logs, metrics, traces
  2. Identify changes: Recent deploys, config
  3. Hypothesis: What’s likely cause?
  4. Test: Validate hypothesis
  5. Fix: Implement solution

Common Fixes

Issue Quick Fix Long-term
Memory leak Restart pods Fix leak
DB overload Scale up Optimize queries
Bad deploy Rollback Better testing
External Failover Redundancy

Rollback

# Kubernetes rollback
kubectl rollout undo deployment/app

# Git revert
git revert HEAD
git push

# Feature flag
feature_flag.disable('new_feature')

Post-Mortem

What Is a Post-Mortem?

Blameless analysis after incident:

  • What happened
  • Why it happened
  • Impact
  • How to prevent recurrence

Template

# Incident Post-Mortem

## Summary
Brief overview of the incident.

## Impact
- Users affected: 10,000
- Duration: 45 minutes
- Revenue impact: $X

## Root Cause
Detailed explanation.

## Timeline
- 10:00 Alert triggered
- 10:15 Incident declared
- 10:30 Root cause identified
- 10:45 Fix deployed

## Action Items
- [ ] Add alert for X (Owner: Alice, Due: Jan 15)
- [ ] Improve database monitoring (Owner: Bob, Due: Jan 20)
- [ ] Update runbook (Owner: Charlie, Due: Jan 10)

Blameless Culture

  • Focus on systems, not people
  • Ask “what” not “who”
  • Share learnings
  • Improve together

Prevention

Lessons to Actions

Finding Action
Missing alert Add monitoring
Slow detection Improve alerts
Unclear process Update runbook
Bad rollback Test rollbacks

Continuous Improvement

  • Regular incident reviews
  • Game days (chaos testing)
  • Regular improvements
  • Share learnings

On-Call Best Practices

Being On-Call

  • Know your systems
  • Have access ready
  • Document findings
  • Hand off properly

Supporting On-Call

  • Clear escalation paths
  • Reasonable schedules
  • Training
  • Recognition

Tools

Incident Management

  • PagerDuty: Alerting and on-call
  • OpsGenie: Alert management
  • VictorOps: Incident response
  • FireHydrant: Modern incidents

Post-Incident

  • Confluence: Documentation
  • Notion: Knowledge base
  • GitHub Issues: Action tracking

Conclusion

Effective incident management requires preparation, clear processes, and a learning mindset. Build robust monitoring, train your teams, and use incidents as opportunities to improve your systems.


Resources

Comments