Skip to main content
โšก Calmops

Incident Response: Postmortems & Prevention Systems

Introduction

Incidents are inevitable in complex systems. What matters is how quickly you respond and what you learn. Effective incident response processes minimize impact, while blameless postmortems extract maximum learning. Many organizations lack structured incident response, resulting in slow recovery and repeated failures.

This comprehensive guide covers incident response processes and postmortem best practices.


Core Concepts

Incident

Unplanned interruption or reduction in quality of service.

Severity

Impact level of incident (critical, high, medium, low).

MTTR (Mean Time To Recovery)

Average time to restore service after incident.

MTTD (Mean Time To Detect)

Average time to detect incident.

MTBF (Mean Time Between Failures)

Average time between incidents.

Postmortem

Analysis of incident to identify root causes and improvements.

Blameless Culture

Focus on systems and processes, not individual blame.

Root Cause

Underlying reason incident occurred.

Prevention

Changes to prevent similar incidents.


Incident Response Process

Incident Severity Levels

class IncidentSeverity:
    CRITICAL = {
        'level': 1,
        'description': 'Complete service outage',
        'response_time': '5 minutes',
        'escalation': 'VP Engineering',
        'communication': 'Every 15 minutes'
    }
    
    HIGH = {
        'level': 2,
        'description': 'Significant degradation',
        'response_time': '15 minutes',
        'escalation': 'Engineering Manager',
        'communication': 'Every 30 minutes'
    }
    
    MEDIUM = {
        'level': 3,
        'description': 'Minor degradation',
        'response_time': '1 hour',
        'escalation': 'Team Lead',
        'communication': 'Hourly'
    }
    
    LOW = {
        'level': 4,
        'description': 'Minimal impact',
        'response_time': '4 hours',
        'escalation': 'On-call engineer',
        'communication': 'Daily'
    }

Incident Response Workflow

class IncidentResponse:
    def __init__(self):
        self.incident_id = None
        self.status = 'open'
        self.severity = None
        self.start_time = None
        self.end_time = None
    
    def detect_incident(self, alert):
        """Detect incident from alert"""
        self.incident_id = generate_id()
        self.start_time = time.time()
        self.severity = determine_severity(alert)
        
        # Notify team
        notify_team(self.incident_id, self.severity)
        
        # Create incident channel
        create_incident_channel(self.incident_id)
        
        return self.incident_id
    
    def acknowledge_incident(self, responder):
        """Acknowledge incident"""
        self.responder = responder
        
        # Update status
        update_status('acknowledged')
        
        # Start investigation
        start_investigation()
    
    def mitigate_incident(self, mitigation):
        """Apply mitigation"""
        # Execute mitigation
        execute_mitigation(mitigation)
        
        # Verify service restored
        if verify_service_restored():
            self.status = 'mitigated'
            self.end_time = time.time()
            
            # Calculate MTTR
            mttr = self.end_time - self.start_time
            log_metric('mttr', mttr)
    
    def resolve_incident(self):
        """Resolve incident"""
        self.status = 'resolved'
        
        # Schedule postmortem
        schedule_postmortem(self.incident_id)
        
        # Notify stakeholders
        notify_stakeholders('incident_resolved')

Blameless Postmortem

Postmortem Template

# Incident Postmortem

## Incident Summary
- **Incident ID**: INC-2025-001
- **Date**: 2025-01-15
- **Duration**: 45 minutes
- **Severity**: High
- **Impact**: 10% of users affected

## Timeline
- 14:30 - Alert triggered for high error rate
- 14:32 - On-call engineer acknowledged
- 14:35 - Root cause identified: database connection pool exhausted
- 14:40 - Mitigation applied: restarted database connection pool
- 14:45 - Service restored
- 15:15 - All-clear confirmed

## Root Cause Analysis
The database connection pool was exhausted due to:
1. Increased traffic from marketing campaign
2. Slow database queries causing connections to be held longer
3. Connection pool size not adjusted for increased load

## Contributing Factors
- No alerting on connection pool usage
- Load testing did not simulate marketing campaign traffic
- Database query optimization not prioritized

## What Went Well
- Alert triggered quickly
- On-call engineer responded immediately
- Mitigation was straightforward
- Communication was clear

## What Could Be Improved
- Earlier detection of connection pool exhaustion
- Better load testing
- Database query optimization
- Capacity planning for marketing campaigns

## Action Items
1. Add alerting for connection pool usage (Owner: Database Team, Due: 1 week)
2. Optimize slow database queries (Owner: Backend Team, Due: 2 weeks)
3. Increase connection pool size (Owner: DevOps, Due: 3 days)
4. Update load testing to include marketing scenarios (Owner: QA, Due: 1 week)
5. Implement capacity planning process (Owner: Engineering Manager, Due: 2 weeks)

## Lessons Learned
- Connection pool exhaustion can cause cascading failures
- Marketing campaigns need coordination with engineering
- Load testing must simulate real-world scenarios

Postmortem Best Practices

class PostmortemFacilitator:
    def __init__(self, incident_id):
        self.incident_id = incident_id
        self.participants = []
        self.findings = []
    
    def facilitate_postmortem(self):
        """Facilitate blameless postmortem"""
        
        # 1. Gather facts
        facts = self.gather_facts()
        
        # 2. Identify root causes (not blame)
        root_causes = self.identify_root_causes(facts)
        
        # 3. Discuss contributing factors
        contributing_factors = self.discuss_contributing_factors()
        
        # 4. Identify action items
        action_items = self.identify_action_items(root_causes)
        
        # 5. Document findings
        self.document_findings(facts, root_causes, action_items)
        
        return {
            'facts': facts,
            'root_causes': root_causes,
            'action_items': action_items
        }
    
    def gather_facts(self):
        """Gather objective facts"""
        facts = []
        
        # Timeline of events
        facts.append(self.get_timeline())
        
        # System metrics during incident
        facts.append(self.get_metrics())
        
        # Log entries
        facts.append(self.get_logs())
        
        return facts
    
    def identify_root_causes(self, facts):
        """Identify root causes (not blame)"""
        # Focus on systems and processes, not people
        
        root_causes = []
        
        for fact in facts:
            # Ask "why" multiple times
            cause = self.ask_why(fact, depth=5)
            root_causes.append(cause)
        
        return root_causes
    
    def ask_why(self, fact, depth=5):
        """Ask why recursively to find root cause"""
        if depth == 0:
            return fact
        
        # Ask why this happened
        why = input(f"Why did {fact} happen? ")
        
        return self.ask_why(why, depth - 1)

Prevention Systems

Monitoring and Alerting

class PreventionSystem:
    def __init__(self):
        self.alerts = []
        self.thresholds = {}
    
    def add_alert(self, metric, threshold, action):
        """Add preventive alert"""
        self.alerts.append({
            'metric': metric,
            'threshold': threshold,
            'action': action
        })
    
    def check_metrics(self, current_metrics):
        """Check metrics against thresholds"""
        for alert in self.alerts:
            metric_value = current_metrics.get(alert['metric'])
            
            if metric_value > alert['threshold']:
                # Execute preventive action
                self.execute_action(alert['action'])
    
    def execute_action(self, action):
        """Execute preventive action"""
        if action == 'scale_up':
            scale_up_instances()
        elif action == 'increase_pool':
            increase_connection_pool()
        elif action == 'notify':
            notify_team('Preventive action taken')

# Setup prevention system
prevention = PreventionSystem()

# Add preventive alerts
prevention.add_alert(
    metric='connection_pool_usage',
    threshold=0.8,  # 80%
    action='increase_pool'
)

prevention.add_alert(
    metric='cpu_usage',
    threshold=0.75,  # 75%
    action='scale_up'
)

prevention.add_alert(
    metric='error_rate',
    threshold=0.01,  # 1%
    action='notify'
)

Best Practices

  1. Blameless Culture: Focus on systems, not people
  2. Timely Postmortems: Conduct within 48 hours
  3. Diverse Participation: Include multiple perspectives
  4. Action Items: Assign owners and deadlines
  5. Follow-up: Track action item completion
  6. Share Learnings: Distribute postmortems widely
  7. Prevent Recurrence: Focus on prevention
  8. Metrics: Track MTTR, MTTD, MTBF
  9. Automation: Automate incident response
  10. Continuous Improvement: Learn from every incident

External Resources

Incident Management

Postmortem


Conclusion

Effective incident response and blameless postmortems are essential for building reliable systems. By focusing on learning rather than blame, organizations build stronger systems and healthier teams.

Implement structured incident response, conduct thorough postmortems, and use findings to prevent future incidents.

Incident response is the foundation of reliability.

Comments