Incident Management: Handling Production Outages

Introduction

Production incidents happen. How you respond determines whether they become learning opportunities or recurring problems. This guide covers the complete incident management lifecycle from preparation to post-mortem.

Incident Fundamentals

What Is an Incident?

An unplanned interruption or reduction in service quality:

Service outage
Performance degradation
Data loss
Security breach

Severity Levels

Severity	Impact	Example	Response Time
SEV1	Critical - Full outage	All users affected	Immediate
SEV2	Major - Partial outage	50% users affected	15 min
SEV3	Moderate - Degradation	Slow response times	1 hour
SEV4	Minor - Small impact	Few users affected	4 hours

Preparation

Monitoring

Essential metrics to track:

Availability: Is the service up?
Latency: How fast?
Errors: What’s failing?
Saturation: Any resource exhaustion?

Tools

Prometheus: Metrics
Grafana: Visualization
Datadog: APM + monitoring
PagerDuty: Alerting

On-Call Setup

# PagerDuty schedule example
schedule:
  rotation:
    - name: Primary
      users: [alice, bob]
      start: Mon 9am
    - name: Secondary  
      users: [charlie]
  escalation:
    - level: 1
      after: 15 minutes
    - level: 2
      after: 30 minutes

Runbooks

Documented procedures:

# Database Connection Errors Runbook

## Symptoms
- 5xx errors on API
- Database connection pool exhausted

## Diagnosis
1. Check database CPU: `SHOW PROCESSLIST`
2. Check connections: `SELECT count(*) FROM pg_stat_activity`

## Resolution
1. Kill long-running queries
2. Scale up database
3. Restart app pods if needed

## Prevention
- Add connection pool limits
- Set query timeouts
- Add monitoring alerts

Detection

Alerting Strategy

High signal: Few false positives
Actionable: Clear next steps
Timely: Fast enough to respond

Alert Types

# Good alerts
- name: High Error Rate
  condition: error_rate > 0.05 for 5 minutes
  action: Page on-call

- name: High Latency
  condition: p99_latency > 2s for 10 minutes
  action: Create incident

# Avoid
- name: Metric Spike
  condition: any spike
  action: ignore

Response

Incident Commander

Lead the response:

Coordinate all efforts
Make decisions
Communicate status
Delegate tasks

Roles

Role	Responsibility
IC	Overall coordination
Lead	Technical investigation
Comms	Internal/external comms
Scribe	Document timeline

Communication

Internal

Incident channel
Regular updates
Status page updates

External

Status page
Customer emails
Social media

Example Update

[UPDATE] 2:45 PM PST
We are investigating elevated error rates on our API.
Impact: ~15% of requests failing
Team: Working on identifying root cause
ETA: 30 minutes to next update

Resolution

Investigation

Gather data: Logs, metrics, traces
Identify changes: Recent deploys, config
Hypothesis: What’s likely cause?
Test: Validate hypothesis
Fix: Implement solution

Common Fixes

Issue	Quick Fix	Long-term
Memory leak	Restart pods	Fix leak
DB overload	Scale up	Optimize queries
Bad deploy	Rollback	Better testing
External	Failover	Redundancy

Rollback

# Kubernetes rollback
kubectl rollout undo deployment/app

# Git revert
git revert HEAD
git push

# Feature flag
feature_flag.disable('new_feature')

Post-Mortem

What Is a Post-Mortem?

Blameless analysis after incident:

What happened
Why it happened
Impact
How to prevent recurrence

Template

# Incident Post-Mortem

## Summary
Brief overview of the incident.

## Impact
- Users affected: 10,000
- Duration: 45 minutes
- Revenue impact: $X

## Root Cause
Detailed explanation.

## Timeline
- 10:00 Alert triggered
- 10:15 Incident declared
- 10:30 Root cause identified
- 10:45 Fix deployed

## Action Items
- [ ] Add alert for X (Owner: Alice, Due: Jan 15)
- [ ] Improve database monitoring (Owner: Bob, Due: Jan 20)
- [ ] Update runbook (Owner: Charlie, Due: Jan 10)

Blameless Culture

Focus on systems, not people
Ask “what” not “who”
Share learnings
Improve together

Prevention

Lessons to Actions

Finding	Action
Missing alert	Add monitoring
Slow detection	Improve alerts
Unclear process	Update runbook
Bad rollback	Test rollbacks

Continuous Improvement

Regular incident reviews
Game days (chaos testing)
Regular improvements
Share learnings

On-Call Best Practices

Being On-Call

Know your systems
Have access ready
Document findings
Hand off properly

Supporting On-Call

Clear escalation paths
Reasonable schedules
Training
Recognition

Tools

Incident Management

PagerDuty: Alerting and on-call
OpsGenie: Alert management
VictorOps: Incident response
FireHydrant: Modern incidents

Post-Incident

Confluence: Documentation
Notion: Knowledge base
GitHub Issues: Action tracking

Conclusion

Effective incident management requires preparation, clear processes, and a learning mindset. Build robust monitoring, train your teams, and use incidents as opportunities to improve your systems.

Incident Management: Handling Production Outages

Introduction

Incident Fundamentals

What Is an Incident?

Severity Levels

Preparation

Monitoring

Tools

On-Call Setup

Runbooks

Detection

Alerting Strategy

Alert Types

Response

Incident Commander

Roles

Communication

Internal

External

Example Update

Resolution

Investigation

Common Fixes

Rollback

Post-Mortem

What Is a Post-Mortem?

Template

Blameless Culture

Prevention

Lessons to Actions

Continuous Improvement

On-Call Best Practices

Being On-Call

Supporting On-Call

Tools

Incident Management

Post-Incident

Conclusion

Resources

Comments